essential info notes-1
TRANSCRIPT
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 157
Biological Databases Why
There are two main functions of Biological Databases
Making Biological Data available to Scientists As much of information should be
available in one single place (book sit database) Public data ay be difficult to find or
access and collecting it from literature is very time consuming And not all data is
actually published explicitly in an article To make Biological Data available in Computer-readable form Since analysis of
Biological Data almost always involves Computers having the Data in Computer-
readable form ( rather than print or paper) is a necessary first step
One of the first Biological sequence Database was probably the book ―Atlas of Protein
Sequence and Structure by Margaret Dayhoff and colleagues first pu blished in 1965 It
contained the Protein sequences determined at the time and new editions of the book
were published well into the 1970s
The Computer became h storage medium of choice as soon they came with in the reach
of normal scientists Databases were distributed on tapes and later on various kinds of
discs When universities and research institutions were connected to Internet or its
precursors (National Computer Network) it is easy to understand why it became themedium of choice And it is easier to see why WWW ( World Wide Web) based on http
(Hyper text markup language) since beginning of the 1990s is the standard method of
Communication and access for nearly all biological Databases
As biology has increasingly turned into a data-rich science the need for storing and
communicating large database has grown tremendously The obvious examples are the
nucleotide sequences the protein sequences and the 3D structural Data produced by X-
Ray crystallography and macromolecular NMR An new field of Science dealing with
issue challenges and new possibilities created by these database has emerged
Bioinformatics Other type of data that or will soon be available in databases are
metabolic pathways ( KEGG) gene expression data (microarrays) protein-protein
interactions and other types of data related to Biological function and processes
Biological databases have become an important tool in assisting scientists to understand
and explain a host of biological phenomena from the structure of biomolecules and their
interaction to the whole metabolism of organisms and to understanding the evolution of
species This knowledge helps facilitate the fight against diseases assists in the
development of medications and in discovering basic relationships amongst species in
the history of life
The biological knowledge is distributed amongst many different general and specialized
databases This sometimes makes it difficult to ensure the consistency of information
Biological databases cross-reference other databases with accession numbers as one way
of linking their related knowledge together An important resource for finding biological databases is a special yearly issue of the
journal Nucleic Acids Research (NAR) The Database Issue of NAR is freely available
and categorizes many of the publicly vailable online databases related to biology and
bioinformatics
Most important public databasesfor molecular biology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 257
Primary Sequence DBs(collaborative project)ltFONTlt H3gt
DDBJ (DNA DataBase of Japan) EMBL Nucleotide DB (European Molecular Biology Laboratory )
GenBank (National Center for Biotechnology Information)
Meta-DBs
Entrez Gene Unified retrival of gene-centred information (NCBI)
euGenes Assembled information on eukaryotic genomes (Univ of Indiana)
GeneCards (Weizmann Inst)
GenLoc UDB (Weizmann Inst)
SOURCE (Univ of Stanford)
LocusLink (National Center for Biotechnology Information)
Genome Annotation Systems
Ensembl Genome BrowserAutomatically Annotated Genomes (EMBL-EBI and
Wellcome Trust Sanger Inst)
UniGene Automatic partitioning of GenBank sequences (NCBI)
Golden Path UCSC (Univ of California Santa Cruz)
Specialized DBs
CGAP Cancer Genes (National Cancer Institute)
Clone Registry Clone Collections (National Center for Biotechnology
Information)
IMAGE Clone Collections (Image Consortium)
DBGET Hsapiens retrieval system (Univ of Kyoto)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 357
DIP Interacting Proteins (Univ of California)
GDB (Human Genome Organization)
KEGG Functional Db (Univ of Kyoto)
MGI Mouse Genome (Jackson Lab)
OMIM Inherited Diseases (National Center for Biotechnology Information)
SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)
PEDANT Protein Db (Forschungszentrum f Umwelt amp Gesundheit)
List with SNP-Databases
Reactome The Genome Knowledgebase (EBI)
Microarray-DBs
ArrayExpress (European Bioinformatic Institute)
Gene Expression Omnibus (National Center for Biotechnology Information)
maxd (Univ of Manchester)
SMD (Univ of Stanford)
Accession codes Vs identifiers
Many databases in bioinformatics (SWISS-PROT EMBL GenBank Pfam) use a
system where an entry can be identified in two different ways Basically it has two
names
Identifier
Accession code (or number)
The question how to deal with changed updated and deleted entries in databases is
a very tricky problem and the policies for how accession codes and identifiers are
changed or kept constant are not completely consistent between databases or even
over time for one single database
The exact definition of what the identifier and accession code are supposed to denote
varies between the different databases but the basic idea is the following
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 457
Identifier
An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of
letters and digits that generally is interpretable in some meaningful way by a human
for instance as a recognizable abbreviation of the full protein or gene name
SWISS-PROT uses a system where the entry name consists of two parts the first
denotes the protein and the second part denotes the species it is found in For
example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo
sapiens
An identifier can usually change For example the database curators may decide
that the identifier for an entry no longer is appropriate However this does not
happen very often In fact it happens so rarely that itrsquos not really a big problem
Accession code (number)
An accession code (or number) is a number (possibly with a few characters in
front) that uniquely identifies an entry in its database For example the accession
code for KRAF_HUMAN in SWISS-PROT isP04049
The main conceptual difference from the identifier is that it is supposed to be stable
any given accession code will as soon as it has been issued always refer to that
entry or its ancestors It is often called the primary key for the entry The
accession code once issued must always point to its entry even after large changes
have been made to the entry This means that in discussions about specific database
entries (eg an article about a specific protein) one should always give the
accession code for the entry in the relevant database
In the case where two entries are merged into one single then the new entry
will have both accession codes where one will be theprimary and the other
the secondary accession code When an entry is split into two both new entries
will get new accession codes but will also have the old accession code as secondary
codes
NUCLEOTIDE DATABASES
NCBIrsquos sequence databases accept genome data from sequencing projects
from around the world and serve as the cornerstone of bioinformatics
research
GenBank An annotated collection of all publicly available nucleotide and amino acid
sequences
EST database A collection of expressed sequence tags or short single-pass
sequence reads from mRNA (cDNA)
GSS database A database of genome survey sequences or short single-pass
genomic sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 557
HomoloGene A gene homology tool that compares nucleotide sequences between
pairs of organisms in order to identify putative orthologs
HTG database A collection of high-throughput genome sequences from large-scale
genome sequencing centers including unfinished and finished sequences
SNPs database A central repository for both single-base nucleotide substitutions
and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including
genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations
both within NCBI and with external groups support our data-gathering efforts
STS database A database of sequence tagged sites or short sequences that are
operationally unique in the genome
UniSTS A unified non-redundant view of sequence tagged sites (STSs)
UniGene A collection of ESTs and full-length mRNA sequences organized into
clusters each representing a unique known or putative human gene annotated with
mapping and expression information and cross-references to other sources
DNA amp RNA Databases
Major Sequence Repositories ndash Human Chromosome Information ndash Organelle
Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash
SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash
Specialized Databases
Major Sequence Repositories
DDBJ DNA databank of Japan
EMBL Maintained by EMBLGenBank Maintained by NCBI
Human Chromosome Information
Click the link below to access chromosome information
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 X Y
Organelle Genome Databases
OGMP Organell genome megasequencing program
GOBASE An organelle genome database
MitoMap Human mitochondrial genome database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 657
RNA Databases
Rfam RNA familiy database
RNA base Database of RNA structures
tRNA database Database of tRNAs
tRNA tRNA sequences and genes
sRNA Small RNA database
Comparative amp Phylogenetic Databases
COG Phylogenetic classification of proteins
DHMHD Human-mouse homology database
HomoloGene Gene homologies across species
Homophila Human disease to Drosophila gene database
HOVERGEN Database of homologous vertebrate genes
TreeBase A database of phylogenetic knowledge
XREF Cross-referencing with model organisms
SNPs Mutations amp Variations Databases
ALPSbase Database of mutations causing human ALPS
dbSNP Single nucleotide polymorphism database at NCBI
HGVbase Human Genome Variation database
Alternative Splicing Databases
ASAP Alternate splicing analysis tool at UCLA
ASG Alternate splicing gallery
HASDB Human alternative splicing database at UCLA
AsMamDB alternatively spliced genes in human mouse and rat
ASD Alternative splicing database at CSHL
Specialised Databases
ABIM Links to several genomics database
ACUTS Ancient conserved untranslated sequences
AGSD Animal genome size database
AmiGO The Gene Ontology database
ARGH The acronym database
ASDB Database of alternatively spliced genes
BACPAC BAC and PAC genomic DNA library info
BBID Biological Biochemical image database
Cardiac gene database CHLC Genetic markers on chromosomes
COGENT Complete genome tracking database
COMPEL Composite regulatory elements in eukaryotes
CUTG Codon usage database
dbEST Database of expressed sequences or mRNA
dbGSS Genome survey sequence database
dbSTS Sequence tagged sites (STS)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 757
DBTSS Database of transcriptional start sites
DOGS Database of genome sizes
EID The exon-intron database ndash Harvard
Exon-Intron Exon-Intron database ndash Singapore
EPD Eukaryotic promotor database
FlyTrap HTML based gene expression databaseGDB The genome database
GenLink Resources for human genetic and telomere research
GeneKnockouts Gene knockout information
GENOTK Human cDNA database
GEO Gene expression omnibus NCBI
GOLD Information on genome projects around the world
GSDBThe Genome Sequence DataBase
HGI TIGR human gene index
HTGS High-through-put genomic sequence at NCBI
IMAGE The largest collection of DNA sequences clones
IMGT The international ImMunoGeneTics information system
IPCN Index to Plant Chromosome Numbers database
LocusLink Single query interface to sequence and genetic loci
TelDB The telomere database
MitoDat Mitochondrial nuclear genes
Mouse EST NIA mouse cDNA project
MPSS Searchable databases of several species
NDB Nucleic acid database
NEDO Human cDNA sequence database
NPD Nuclear protein database
Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute
PLACE Database of plant cis-acting regulatory DNA elements
RDP Ribosomal database project
RDB Receptor database at NIHS Japan
Refseq The NCBI reference sequence project
RHdb Radiation hybrid physical map of chromosomes
SHIGAN SHared Information of GENetic resources Japan
SpliceDB Canonical and non-canonical splice site sequences
STACK Consensus human EST database
TAED The adaptive evolution database
TIGR Curated databases of microbes plants and humans
TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database
UniGene Cluster of sequences for unique genes at NCBI
UniSTS Nonredundent collection of STS
Protein Databases
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 257
Primary Sequence DBs(collaborative project)ltFONTlt H3gt
DDBJ (DNA DataBase of Japan) EMBL Nucleotide DB (European Molecular Biology Laboratory )
GenBank (National Center for Biotechnology Information)
Meta-DBs
Entrez Gene Unified retrival of gene-centred information (NCBI)
euGenes Assembled information on eukaryotic genomes (Univ of Indiana)
GeneCards (Weizmann Inst)
GenLoc UDB (Weizmann Inst)
SOURCE (Univ of Stanford)
LocusLink (National Center for Biotechnology Information)
Genome Annotation Systems
Ensembl Genome BrowserAutomatically Annotated Genomes (EMBL-EBI and
Wellcome Trust Sanger Inst)
UniGene Automatic partitioning of GenBank sequences (NCBI)
Golden Path UCSC (Univ of California Santa Cruz)
Specialized DBs
CGAP Cancer Genes (National Cancer Institute)
Clone Registry Clone Collections (National Center for Biotechnology
Information)
IMAGE Clone Collections (Image Consortium)
DBGET Hsapiens retrieval system (Univ of Kyoto)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 357
DIP Interacting Proteins (Univ of California)
GDB (Human Genome Organization)
KEGG Functional Db (Univ of Kyoto)
MGI Mouse Genome (Jackson Lab)
OMIM Inherited Diseases (National Center for Biotechnology Information)
SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)
PEDANT Protein Db (Forschungszentrum f Umwelt amp Gesundheit)
List with SNP-Databases
Reactome The Genome Knowledgebase (EBI)
Microarray-DBs
ArrayExpress (European Bioinformatic Institute)
Gene Expression Omnibus (National Center for Biotechnology Information)
maxd (Univ of Manchester)
SMD (Univ of Stanford)
Accession codes Vs identifiers
Many databases in bioinformatics (SWISS-PROT EMBL GenBank Pfam) use a
system where an entry can be identified in two different ways Basically it has two
names
Identifier
Accession code (or number)
The question how to deal with changed updated and deleted entries in databases is
a very tricky problem and the policies for how accession codes and identifiers are
changed or kept constant are not completely consistent between databases or even
over time for one single database
The exact definition of what the identifier and accession code are supposed to denote
varies between the different databases but the basic idea is the following
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 457
Identifier
An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of
letters and digits that generally is interpretable in some meaningful way by a human
for instance as a recognizable abbreviation of the full protein or gene name
SWISS-PROT uses a system where the entry name consists of two parts the first
denotes the protein and the second part denotes the species it is found in For
example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo
sapiens
An identifier can usually change For example the database curators may decide
that the identifier for an entry no longer is appropriate However this does not
happen very often In fact it happens so rarely that itrsquos not really a big problem
Accession code (number)
An accession code (or number) is a number (possibly with a few characters in
front) that uniquely identifies an entry in its database For example the accession
code for KRAF_HUMAN in SWISS-PROT isP04049
The main conceptual difference from the identifier is that it is supposed to be stable
any given accession code will as soon as it has been issued always refer to that
entry or its ancestors It is often called the primary key for the entry The
accession code once issued must always point to its entry even after large changes
have been made to the entry This means that in discussions about specific database
entries (eg an article about a specific protein) one should always give the
accession code for the entry in the relevant database
In the case where two entries are merged into one single then the new entry
will have both accession codes where one will be theprimary and the other
the secondary accession code When an entry is split into two both new entries
will get new accession codes but will also have the old accession code as secondary
codes
NUCLEOTIDE DATABASES
NCBIrsquos sequence databases accept genome data from sequencing projects
from around the world and serve as the cornerstone of bioinformatics
research
GenBank An annotated collection of all publicly available nucleotide and amino acid
sequences
EST database A collection of expressed sequence tags or short single-pass
sequence reads from mRNA (cDNA)
GSS database A database of genome survey sequences or short single-pass
genomic sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 557
HomoloGene A gene homology tool that compares nucleotide sequences between
pairs of organisms in order to identify putative orthologs
HTG database A collection of high-throughput genome sequences from large-scale
genome sequencing centers including unfinished and finished sequences
SNPs database A central repository for both single-base nucleotide substitutions
and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including
genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations
both within NCBI and with external groups support our data-gathering efforts
STS database A database of sequence tagged sites or short sequences that are
operationally unique in the genome
UniSTS A unified non-redundant view of sequence tagged sites (STSs)
UniGene A collection of ESTs and full-length mRNA sequences organized into
clusters each representing a unique known or putative human gene annotated with
mapping and expression information and cross-references to other sources
DNA amp RNA Databases
Major Sequence Repositories ndash Human Chromosome Information ndash Organelle
Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash
SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash
Specialized Databases
Major Sequence Repositories
DDBJ DNA databank of Japan
EMBL Maintained by EMBLGenBank Maintained by NCBI
Human Chromosome Information
Click the link below to access chromosome information
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 X Y
Organelle Genome Databases
OGMP Organell genome megasequencing program
GOBASE An organelle genome database
MitoMap Human mitochondrial genome database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 657
RNA Databases
Rfam RNA familiy database
RNA base Database of RNA structures
tRNA database Database of tRNAs
tRNA tRNA sequences and genes
sRNA Small RNA database
Comparative amp Phylogenetic Databases
COG Phylogenetic classification of proteins
DHMHD Human-mouse homology database
HomoloGene Gene homologies across species
Homophila Human disease to Drosophila gene database
HOVERGEN Database of homologous vertebrate genes
TreeBase A database of phylogenetic knowledge
XREF Cross-referencing with model organisms
SNPs Mutations amp Variations Databases
ALPSbase Database of mutations causing human ALPS
dbSNP Single nucleotide polymorphism database at NCBI
HGVbase Human Genome Variation database
Alternative Splicing Databases
ASAP Alternate splicing analysis tool at UCLA
ASG Alternate splicing gallery
HASDB Human alternative splicing database at UCLA
AsMamDB alternatively spliced genes in human mouse and rat
ASD Alternative splicing database at CSHL
Specialised Databases
ABIM Links to several genomics database
ACUTS Ancient conserved untranslated sequences
AGSD Animal genome size database
AmiGO The Gene Ontology database
ARGH The acronym database
ASDB Database of alternatively spliced genes
BACPAC BAC and PAC genomic DNA library info
BBID Biological Biochemical image database
Cardiac gene database CHLC Genetic markers on chromosomes
COGENT Complete genome tracking database
COMPEL Composite regulatory elements in eukaryotes
CUTG Codon usage database
dbEST Database of expressed sequences or mRNA
dbGSS Genome survey sequence database
dbSTS Sequence tagged sites (STS)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 757
DBTSS Database of transcriptional start sites
DOGS Database of genome sizes
EID The exon-intron database ndash Harvard
Exon-Intron Exon-Intron database ndash Singapore
EPD Eukaryotic promotor database
FlyTrap HTML based gene expression databaseGDB The genome database
GenLink Resources for human genetic and telomere research
GeneKnockouts Gene knockout information
GENOTK Human cDNA database
GEO Gene expression omnibus NCBI
GOLD Information on genome projects around the world
GSDBThe Genome Sequence DataBase
HGI TIGR human gene index
HTGS High-through-put genomic sequence at NCBI
IMAGE The largest collection of DNA sequences clones
IMGT The international ImMunoGeneTics information system
IPCN Index to Plant Chromosome Numbers database
LocusLink Single query interface to sequence and genetic loci
TelDB The telomere database
MitoDat Mitochondrial nuclear genes
Mouse EST NIA mouse cDNA project
MPSS Searchable databases of several species
NDB Nucleic acid database
NEDO Human cDNA sequence database
NPD Nuclear protein database
Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute
PLACE Database of plant cis-acting regulatory DNA elements
RDP Ribosomal database project
RDB Receptor database at NIHS Japan
Refseq The NCBI reference sequence project
RHdb Radiation hybrid physical map of chromosomes
SHIGAN SHared Information of GENetic resources Japan
SpliceDB Canonical and non-canonical splice site sequences
STACK Consensus human EST database
TAED The adaptive evolution database
TIGR Curated databases of microbes plants and humans
TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database
UniGene Cluster of sequences for unique genes at NCBI
UniSTS Nonredundent collection of STS
Protein Databases
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 357
DIP Interacting Proteins (Univ of California)
GDB (Human Genome Organization)
KEGG Functional Db (Univ of Kyoto)
MGI Mouse Genome (Jackson Lab)
OMIM Inherited Diseases (National Center for Biotechnology Information)
SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)
PEDANT Protein Db (Forschungszentrum f Umwelt amp Gesundheit)
List with SNP-Databases
Reactome The Genome Knowledgebase (EBI)
Microarray-DBs
ArrayExpress (European Bioinformatic Institute)
Gene Expression Omnibus (National Center for Biotechnology Information)
maxd (Univ of Manchester)
SMD (Univ of Stanford)
Accession codes Vs identifiers
Many databases in bioinformatics (SWISS-PROT EMBL GenBank Pfam) use a
system where an entry can be identified in two different ways Basically it has two
names
Identifier
Accession code (or number)
The question how to deal with changed updated and deleted entries in databases is
a very tricky problem and the policies for how accession codes and identifiers are
changed or kept constant are not completely consistent between databases or even
over time for one single database
The exact definition of what the identifier and accession code are supposed to denote
varies between the different databases but the basic idea is the following
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 457
Identifier
An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of
letters and digits that generally is interpretable in some meaningful way by a human
for instance as a recognizable abbreviation of the full protein or gene name
SWISS-PROT uses a system where the entry name consists of two parts the first
denotes the protein and the second part denotes the species it is found in For
example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo
sapiens
An identifier can usually change For example the database curators may decide
that the identifier for an entry no longer is appropriate However this does not
happen very often In fact it happens so rarely that itrsquos not really a big problem
Accession code (number)
An accession code (or number) is a number (possibly with a few characters in
front) that uniquely identifies an entry in its database For example the accession
code for KRAF_HUMAN in SWISS-PROT isP04049
The main conceptual difference from the identifier is that it is supposed to be stable
any given accession code will as soon as it has been issued always refer to that
entry or its ancestors It is often called the primary key for the entry The
accession code once issued must always point to its entry even after large changes
have been made to the entry This means that in discussions about specific database
entries (eg an article about a specific protein) one should always give the
accession code for the entry in the relevant database
In the case where two entries are merged into one single then the new entry
will have both accession codes where one will be theprimary and the other
the secondary accession code When an entry is split into two both new entries
will get new accession codes but will also have the old accession code as secondary
codes
NUCLEOTIDE DATABASES
NCBIrsquos sequence databases accept genome data from sequencing projects
from around the world and serve as the cornerstone of bioinformatics
research
GenBank An annotated collection of all publicly available nucleotide and amino acid
sequences
EST database A collection of expressed sequence tags or short single-pass
sequence reads from mRNA (cDNA)
GSS database A database of genome survey sequences or short single-pass
genomic sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 557
HomoloGene A gene homology tool that compares nucleotide sequences between
pairs of organisms in order to identify putative orthologs
HTG database A collection of high-throughput genome sequences from large-scale
genome sequencing centers including unfinished and finished sequences
SNPs database A central repository for both single-base nucleotide substitutions
and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including
genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations
both within NCBI and with external groups support our data-gathering efforts
STS database A database of sequence tagged sites or short sequences that are
operationally unique in the genome
UniSTS A unified non-redundant view of sequence tagged sites (STSs)
UniGene A collection of ESTs and full-length mRNA sequences organized into
clusters each representing a unique known or putative human gene annotated with
mapping and expression information and cross-references to other sources
DNA amp RNA Databases
Major Sequence Repositories ndash Human Chromosome Information ndash Organelle
Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash
SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash
Specialized Databases
Major Sequence Repositories
DDBJ DNA databank of Japan
EMBL Maintained by EMBLGenBank Maintained by NCBI
Human Chromosome Information
Click the link below to access chromosome information
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 X Y
Organelle Genome Databases
OGMP Organell genome megasequencing program
GOBASE An organelle genome database
MitoMap Human mitochondrial genome database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 657
RNA Databases
Rfam RNA familiy database
RNA base Database of RNA structures
tRNA database Database of tRNAs
tRNA tRNA sequences and genes
sRNA Small RNA database
Comparative amp Phylogenetic Databases
COG Phylogenetic classification of proteins
DHMHD Human-mouse homology database
HomoloGene Gene homologies across species
Homophila Human disease to Drosophila gene database
HOVERGEN Database of homologous vertebrate genes
TreeBase A database of phylogenetic knowledge
XREF Cross-referencing with model organisms
SNPs Mutations amp Variations Databases
ALPSbase Database of mutations causing human ALPS
dbSNP Single nucleotide polymorphism database at NCBI
HGVbase Human Genome Variation database
Alternative Splicing Databases
ASAP Alternate splicing analysis tool at UCLA
ASG Alternate splicing gallery
HASDB Human alternative splicing database at UCLA
AsMamDB alternatively spliced genes in human mouse and rat
ASD Alternative splicing database at CSHL
Specialised Databases
ABIM Links to several genomics database
ACUTS Ancient conserved untranslated sequences
AGSD Animal genome size database
AmiGO The Gene Ontology database
ARGH The acronym database
ASDB Database of alternatively spliced genes
BACPAC BAC and PAC genomic DNA library info
BBID Biological Biochemical image database
Cardiac gene database CHLC Genetic markers on chromosomes
COGENT Complete genome tracking database
COMPEL Composite regulatory elements in eukaryotes
CUTG Codon usage database
dbEST Database of expressed sequences or mRNA
dbGSS Genome survey sequence database
dbSTS Sequence tagged sites (STS)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 757
DBTSS Database of transcriptional start sites
DOGS Database of genome sizes
EID The exon-intron database ndash Harvard
Exon-Intron Exon-Intron database ndash Singapore
EPD Eukaryotic promotor database
FlyTrap HTML based gene expression databaseGDB The genome database
GenLink Resources for human genetic and telomere research
GeneKnockouts Gene knockout information
GENOTK Human cDNA database
GEO Gene expression omnibus NCBI
GOLD Information on genome projects around the world
GSDBThe Genome Sequence DataBase
HGI TIGR human gene index
HTGS High-through-put genomic sequence at NCBI
IMAGE The largest collection of DNA sequences clones
IMGT The international ImMunoGeneTics information system
IPCN Index to Plant Chromosome Numbers database
LocusLink Single query interface to sequence and genetic loci
TelDB The telomere database
MitoDat Mitochondrial nuclear genes
Mouse EST NIA mouse cDNA project
MPSS Searchable databases of several species
NDB Nucleic acid database
NEDO Human cDNA sequence database
NPD Nuclear protein database
Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute
PLACE Database of plant cis-acting regulatory DNA elements
RDP Ribosomal database project
RDB Receptor database at NIHS Japan
Refseq The NCBI reference sequence project
RHdb Radiation hybrid physical map of chromosomes
SHIGAN SHared Information of GENetic resources Japan
SpliceDB Canonical and non-canonical splice site sequences
STACK Consensus human EST database
TAED The adaptive evolution database
TIGR Curated databases of microbes plants and humans
TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database
UniGene Cluster of sequences for unique genes at NCBI
UniSTS Nonredundent collection of STS
Protein Databases
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 457
Identifier
An identifier (ldquolocusrdquo in GenBank ldquoentry namerdquo in SWISS-PROT) is a string of
letters and digits that generally is interpretable in some meaningful way by a human
for instance as a recognizable abbreviation of the full protein or gene name
SWISS-PROT uses a system where the entry name consists of two parts the first
denotes the protein and the second part denotes the species it is found in For
example KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo
sapiens
An identifier can usually change For example the database curators may decide
that the identifier for an entry no longer is appropriate However this does not
happen very often In fact it happens so rarely that itrsquos not really a big problem
Accession code (number)
An accession code (or number) is a number (possibly with a few characters in
front) that uniquely identifies an entry in its database For example the accession
code for KRAF_HUMAN in SWISS-PROT isP04049
The main conceptual difference from the identifier is that it is supposed to be stable
any given accession code will as soon as it has been issued always refer to that
entry or its ancestors It is often called the primary key for the entry The
accession code once issued must always point to its entry even after large changes
have been made to the entry This means that in discussions about specific database
entries (eg an article about a specific protein) one should always give the
accession code for the entry in the relevant database
In the case where two entries are merged into one single then the new entry
will have both accession codes where one will be theprimary and the other
the secondary accession code When an entry is split into two both new entries
will get new accession codes but will also have the old accession code as secondary
codes
NUCLEOTIDE DATABASES
NCBIrsquos sequence databases accept genome data from sequencing projects
from around the world and serve as the cornerstone of bioinformatics
research
GenBank An annotated collection of all publicly available nucleotide and amino acid
sequences
EST database A collection of expressed sequence tags or short single-pass
sequence reads from mRNA (cDNA)
GSS database A database of genome survey sequences or short single-pass
genomic sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 557
HomoloGene A gene homology tool that compares nucleotide sequences between
pairs of organisms in order to identify putative orthologs
HTG database A collection of high-throughput genome sequences from large-scale
genome sequencing centers including unfinished and finished sequences
SNPs database A central repository for both single-base nucleotide substitutions
and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including
genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations
both within NCBI and with external groups support our data-gathering efforts
STS database A database of sequence tagged sites or short sequences that are
operationally unique in the genome
UniSTS A unified non-redundant view of sequence tagged sites (STSs)
UniGene A collection of ESTs and full-length mRNA sequences organized into
clusters each representing a unique known or putative human gene annotated with
mapping and expression information and cross-references to other sources
DNA amp RNA Databases
Major Sequence Repositories ndash Human Chromosome Information ndash Organelle
Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash
SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash
Specialized Databases
Major Sequence Repositories
DDBJ DNA databank of Japan
EMBL Maintained by EMBLGenBank Maintained by NCBI
Human Chromosome Information
Click the link below to access chromosome information
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 X Y
Organelle Genome Databases
OGMP Organell genome megasequencing program
GOBASE An organelle genome database
MitoMap Human mitochondrial genome database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 657
RNA Databases
Rfam RNA familiy database
RNA base Database of RNA structures
tRNA database Database of tRNAs
tRNA tRNA sequences and genes
sRNA Small RNA database
Comparative amp Phylogenetic Databases
COG Phylogenetic classification of proteins
DHMHD Human-mouse homology database
HomoloGene Gene homologies across species
Homophila Human disease to Drosophila gene database
HOVERGEN Database of homologous vertebrate genes
TreeBase A database of phylogenetic knowledge
XREF Cross-referencing with model organisms
SNPs Mutations amp Variations Databases
ALPSbase Database of mutations causing human ALPS
dbSNP Single nucleotide polymorphism database at NCBI
HGVbase Human Genome Variation database
Alternative Splicing Databases
ASAP Alternate splicing analysis tool at UCLA
ASG Alternate splicing gallery
HASDB Human alternative splicing database at UCLA
AsMamDB alternatively spliced genes in human mouse and rat
ASD Alternative splicing database at CSHL
Specialised Databases
ABIM Links to several genomics database
ACUTS Ancient conserved untranslated sequences
AGSD Animal genome size database
AmiGO The Gene Ontology database
ARGH The acronym database
ASDB Database of alternatively spliced genes
BACPAC BAC and PAC genomic DNA library info
BBID Biological Biochemical image database
Cardiac gene database CHLC Genetic markers on chromosomes
COGENT Complete genome tracking database
COMPEL Composite regulatory elements in eukaryotes
CUTG Codon usage database
dbEST Database of expressed sequences or mRNA
dbGSS Genome survey sequence database
dbSTS Sequence tagged sites (STS)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 757
DBTSS Database of transcriptional start sites
DOGS Database of genome sizes
EID The exon-intron database ndash Harvard
Exon-Intron Exon-Intron database ndash Singapore
EPD Eukaryotic promotor database
FlyTrap HTML based gene expression databaseGDB The genome database
GenLink Resources for human genetic and telomere research
GeneKnockouts Gene knockout information
GENOTK Human cDNA database
GEO Gene expression omnibus NCBI
GOLD Information on genome projects around the world
GSDBThe Genome Sequence DataBase
HGI TIGR human gene index
HTGS High-through-put genomic sequence at NCBI
IMAGE The largest collection of DNA sequences clones
IMGT The international ImMunoGeneTics information system
IPCN Index to Plant Chromosome Numbers database
LocusLink Single query interface to sequence and genetic loci
TelDB The telomere database
MitoDat Mitochondrial nuclear genes
Mouse EST NIA mouse cDNA project
MPSS Searchable databases of several species
NDB Nucleic acid database
NEDO Human cDNA sequence database
NPD Nuclear protein database
Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute
PLACE Database of plant cis-acting regulatory DNA elements
RDP Ribosomal database project
RDB Receptor database at NIHS Japan
Refseq The NCBI reference sequence project
RHdb Radiation hybrid physical map of chromosomes
SHIGAN SHared Information of GENetic resources Japan
SpliceDB Canonical and non-canonical splice site sequences
STACK Consensus human EST database
TAED The adaptive evolution database
TIGR Curated databases of microbes plants and humans
TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database
UniGene Cluster of sequences for unique genes at NCBI
UniSTS Nonredundent collection of STS
Protein Databases
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 557
HomoloGene A gene homology tool that compares nucleotide sequences between
pairs of organisms in order to identify putative orthologs
HTG database A collection of high-throughput genome sequences from large-scale
genome sequencing centers including unfinished and finished sequences
SNPs database A central repository for both single-base nucleotide substitutions
and short deletion and insertion polymorphisms RefSeq A database of non-redundant reference sequences standards including
genomic DNA contigs mRNAs and proteins for known genes Multiple collaborations
both within NCBI and with external groups support our data-gathering efforts
STS database A database of sequence tagged sites or short sequences that are
operationally unique in the genome
UniSTS A unified non-redundant view of sequence tagged sites (STSs)
UniGene A collection of ESTs and full-length mRNA sequences organized into
clusters each representing a unique known or putative human gene annotated with
mapping and expression information and cross-references to other sources
DNA amp RNA Databases
Major Sequence Repositories ndash Human Chromosome Information ndash Organelle
Genome Databases ndash RNA Databases ndash Comparative amp Phylogenetic Databases ndash
SNPs Mutations and Variations Databases ndash Alternative Splicing Databases ndash
Specialized Databases
Major Sequence Repositories
DDBJ DNA databank of Japan
EMBL Maintained by EMBLGenBank Maintained by NCBI
Human Chromosome Information
Click the link below to access chromosome information
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 X Y
Organelle Genome Databases
OGMP Organell genome megasequencing program
GOBASE An organelle genome database
MitoMap Human mitochondrial genome database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 657
RNA Databases
Rfam RNA familiy database
RNA base Database of RNA structures
tRNA database Database of tRNAs
tRNA tRNA sequences and genes
sRNA Small RNA database
Comparative amp Phylogenetic Databases
COG Phylogenetic classification of proteins
DHMHD Human-mouse homology database
HomoloGene Gene homologies across species
Homophila Human disease to Drosophila gene database
HOVERGEN Database of homologous vertebrate genes
TreeBase A database of phylogenetic knowledge
XREF Cross-referencing with model organisms
SNPs Mutations amp Variations Databases
ALPSbase Database of mutations causing human ALPS
dbSNP Single nucleotide polymorphism database at NCBI
HGVbase Human Genome Variation database
Alternative Splicing Databases
ASAP Alternate splicing analysis tool at UCLA
ASG Alternate splicing gallery
HASDB Human alternative splicing database at UCLA
AsMamDB alternatively spliced genes in human mouse and rat
ASD Alternative splicing database at CSHL
Specialised Databases
ABIM Links to several genomics database
ACUTS Ancient conserved untranslated sequences
AGSD Animal genome size database
AmiGO The Gene Ontology database
ARGH The acronym database
ASDB Database of alternatively spliced genes
BACPAC BAC and PAC genomic DNA library info
BBID Biological Biochemical image database
Cardiac gene database CHLC Genetic markers on chromosomes
COGENT Complete genome tracking database
COMPEL Composite regulatory elements in eukaryotes
CUTG Codon usage database
dbEST Database of expressed sequences or mRNA
dbGSS Genome survey sequence database
dbSTS Sequence tagged sites (STS)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 757
DBTSS Database of transcriptional start sites
DOGS Database of genome sizes
EID The exon-intron database ndash Harvard
Exon-Intron Exon-Intron database ndash Singapore
EPD Eukaryotic promotor database
FlyTrap HTML based gene expression databaseGDB The genome database
GenLink Resources for human genetic and telomere research
GeneKnockouts Gene knockout information
GENOTK Human cDNA database
GEO Gene expression omnibus NCBI
GOLD Information on genome projects around the world
GSDBThe Genome Sequence DataBase
HGI TIGR human gene index
HTGS High-through-put genomic sequence at NCBI
IMAGE The largest collection of DNA sequences clones
IMGT The international ImMunoGeneTics information system
IPCN Index to Plant Chromosome Numbers database
LocusLink Single query interface to sequence and genetic loci
TelDB The telomere database
MitoDat Mitochondrial nuclear genes
Mouse EST NIA mouse cDNA project
MPSS Searchable databases of several species
NDB Nucleic acid database
NEDO Human cDNA sequence database
NPD Nuclear protein database
Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute
PLACE Database of plant cis-acting regulatory DNA elements
RDP Ribosomal database project
RDB Receptor database at NIHS Japan
Refseq The NCBI reference sequence project
RHdb Radiation hybrid physical map of chromosomes
SHIGAN SHared Information of GENetic resources Japan
SpliceDB Canonical and non-canonical splice site sequences
STACK Consensus human EST database
TAED The adaptive evolution database
TIGR Curated databases of microbes plants and humans
TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database
UniGene Cluster of sequences for unique genes at NCBI
UniSTS Nonredundent collection of STS
Protein Databases
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 657
RNA Databases
Rfam RNA familiy database
RNA base Database of RNA structures
tRNA database Database of tRNAs
tRNA tRNA sequences and genes
sRNA Small RNA database
Comparative amp Phylogenetic Databases
COG Phylogenetic classification of proteins
DHMHD Human-mouse homology database
HomoloGene Gene homologies across species
Homophila Human disease to Drosophila gene database
HOVERGEN Database of homologous vertebrate genes
TreeBase A database of phylogenetic knowledge
XREF Cross-referencing with model organisms
SNPs Mutations amp Variations Databases
ALPSbase Database of mutations causing human ALPS
dbSNP Single nucleotide polymorphism database at NCBI
HGVbase Human Genome Variation database
Alternative Splicing Databases
ASAP Alternate splicing analysis tool at UCLA
ASG Alternate splicing gallery
HASDB Human alternative splicing database at UCLA
AsMamDB alternatively spliced genes in human mouse and rat
ASD Alternative splicing database at CSHL
Specialised Databases
ABIM Links to several genomics database
ACUTS Ancient conserved untranslated sequences
AGSD Animal genome size database
AmiGO The Gene Ontology database
ARGH The acronym database
ASDB Database of alternatively spliced genes
BACPAC BAC and PAC genomic DNA library info
BBID Biological Biochemical image database
Cardiac gene database CHLC Genetic markers on chromosomes
COGENT Complete genome tracking database
COMPEL Composite regulatory elements in eukaryotes
CUTG Codon usage database
dbEST Database of expressed sequences or mRNA
dbGSS Genome survey sequence database
dbSTS Sequence tagged sites (STS)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 757
DBTSS Database of transcriptional start sites
DOGS Database of genome sizes
EID The exon-intron database ndash Harvard
Exon-Intron Exon-Intron database ndash Singapore
EPD Eukaryotic promotor database
FlyTrap HTML based gene expression databaseGDB The genome database
GenLink Resources for human genetic and telomere research
GeneKnockouts Gene knockout information
GENOTK Human cDNA database
GEO Gene expression omnibus NCBI
GOLD Information on genome projects around the world
GSDBThe Genome Sequence DataBase
HGI TIGR human gene index
HTGS High-through-put genomic sequence at NCBI
IMAGE The largest collection of DNA sequences clones
IMGT The international ImMunoGeneTics information system
IPCN Index to Plant Chromosome Numbers database
LocusLink Single query interface to sequence and genetic loci
TelDB The telomere database
MitoDat Mitochondrial nuclear genes
Mouse EST NIA mouse cDNA project
MPSS Searchable databases of several species
NDB Nucleic acid database
NEDO Human cDNA sequence database
NPD Nuclear protein database
Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute
PLACE Database of plant cis-acting regulatory DNA elements
RDP Ribosomal database project
RDB Receptor database at NIHS Japan
Refseq The NCBI reference sequence project
RHdb Radiation hybrid physical map of chromosomes
SHIGAN SHared Information of GENetic resources Japan
SpliceDB Canonical and non-canonical splice site sequences
STACK Consensus human EST database
TAED The adaptive evolution database
TIGR Curated databases of microbes plants and humans
TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database
UniGene Cluster of sequences for unique genes at NCBI
UniSTS Nonredundent collection of STS
Protein Databases
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 757
DBTSS Database of transcriptional start sites
DOGS Database of genome sizes
EID The exon-intron database ndash Harvard
Exon-Intron Exon-Intron database ndash Singapore
EPD Eukaryotic promotor database
FlyTrap HTML based gene expression databaseGDB The genome database
GenLink Resources for human genetic and telomere research
GeneKnockouts Gene knockout information
GENOTK Human cDNA database
GEO Gene expression omnibus NCBI
GOLD Information on genome projects around the world
GSDBThe Genome Sequence DataBase
HGI TIGR human gene index
HTGS High-through-put genomic sequence at NCBI
IMAGE The largest collection of DNA sequences clones
IMGT The international ImMunoGeneTics information system
IPCN Index to Plant Chromosome Numbers database
LocusLink Single query interface to sequence and genetic loci
TelDB The telomere database
MitoDat Mitochondrial nuclear genes
Mouse EST NIA mouse cDNA project
MPSS Searchable databases of several species
NDB Nucleic acid database
NEDO Human cDNA sequence database
NPD Nuclear protein database
Oomycetes DB Oomycetes database at Virginia Bioinformatics Institute
PLACE Database of plant cis-acting regulatory DNA elements
RDP Ribosomal database project
RDB Receptor database at NIHS Japan
Refseq The NCBI reference sequence project
RHdb Radiation hybrid physical map of chromosomes
SHIGAN SHared Information of GENetic resources Japan
SpliceDB Canonical and non-canonical splice site sequences
STACK Consensus human EST database
TAED The adaptive evolution database
TIGR Curated databases of microbes plants and humans
TRANSFAC The Transcription Factor DatabaseTRRD Transcription Regulatory region database
UniGene Cluster of sequences for unique genes at NCBI
UniSTS Nonredundent collection of STS
Protein Databases
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 857
Protein Sequence Databases ndash Protein Structure Databases ndash Protein Domains Motifs
and Signatures ndash Others
Protein Sequence Databases
Antibodies Sequence and Structure BRENDA Enzyme database
CD Antigens Database of CD antigens
dbCFC Cytokine family database
Histons Histone sequence database
HPRD Human protein reference database
InterPro Intergrated documentation 5resources for protein families
iProClass An integrated protein classification database
KIND A non-redundant protein sequence database
MHCPEP Database of MHC binding peptides
MIPS Munich information centre for protein sequences
PIR Annotated and non-redundant protein sequence database
PIR-ALN Curated database of protein sequence alignments
PIR-NREF PIR nonredundent reference protein database
PMD Protein mutant database
PRF Protein research foundation Japan
ProClass Non-redundant protein database
ProtoMap Hierarchical classification of swissprot proteins
REBASE Restriction enzyme database
RefSeq Reference sequence database at NCBI
SwissProt Curated protein sequence database
SPTR Comprehensive protein sequence database
Transfac Transcription factor database
TrEMBL Annotated translations of EMBL nucleotide sequences
Tumor gene database Genes with cancer-causing mutations
WD repeats WD-repeat family of proteins
Protein Structure Databases
Cath Protein structure classification
HIV Protease HIV protease database 3D structure
PDB 3-D macromolecular structure data
PSI Protein structure initiative
S2F Structure to function projectScop Structural Classification of Proteins
Protein Domains Motifs amp Signatures
BLOCKS Multipe aligned segments of conserved protein regions
CCD Conserved domain database and search service
DOMO Homologous protein domain families
Pfam Database of protein domains and HMMs
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 957
ProDom Protein domain database
Prints Protein motif fingerprint database
Prosite Database of protein families and domains
SMART Simple modular architecture research tool
TIGRFAM Protein families based on HMMs
Others
Phospho Site Database of phosphorylation sites
PROW Protein reviews on the web
Protein Lounge Complete systems biology
Other Databases
Carbohydarate Databases
Carb DB Carbohydrate Sequence and Structure Database
GlycoWord Glycoscience related information
SPECARB Raman Spectra of carbohydrates
Other Databases
AlzGene Alzheimerrsquos disease
Polygenic pathways Alzheimerrsquos disease Bipolar disorder or Schizophrenia
Model Organism Databases and Resources
Arabidopsis ndash Bacteria ndash Sea Bass ndash Cat ndash Cattle ndash Chicken ndash Cotton ndashCyanoBacteria ndash Daphnia ndash Deer ndash Dictyostelium ndash Dog ndash Frog ndash Fruit Fly ndash Fungus ndash
Goat ndash Horse ndash Madaka Fish ndash Maize ndash Malaria ndashMosquito ndash Mouse ndash Pig ndash Plants ndash
Protozoa ndash Puffer Fish ndash Rat ndash Rice ndashRickettsia ndash Salmon ndash Sheep ndash Soy ndash
Sorgham ndash Tetradon ndash Tilapia ndashTurkey ndash Viruses ndash Worm ndash Yeast ndash Zebra Fish
General Information
GMOD Generic Model Organism Database
Model Organisms The WWW virtual library of model organisms
Arabidopsis thaliana
ABRC Arabidopsis biological resource center
AGI Arabidopsis genome initiative
AREX Arabidopsis gene expression database
Arabinet Arabidopsis information on the www
AtGDB An Arabidopsis thalina plant genome database
AtGI TIGR Arabidopsis thaliana gene index
ATGC Genome sequencing at ATGC
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1057
ATIDB Arabidopsis insertion database
CSHL Arabidopsis genome analysis at Cold Spring
ESSA Arabidopsis thalina project at MIPS
Genoscope AGI in France
Kazusa Arabidopsis thaliana genome info Japan
MPSS Massively parallel signature sequencingNASC Nottingham Arabidopsis stock center
Stanford Sequencing of the Arabidopsis genome at Stanford
TAIR Arabidopsis information resource
TIGR TIGR Arabidopsis genome annotation database
Wustl Arabidopsis genome at Washington university
Trees A forest tree genome database
Bacterial genomes
B Subtilus Bacillus subtilus database
Chlamydomonas Chlamydomonas genetics center
E coli Ecoli genome project
MGD Microbial germ plasm database
Microbial Microbial Genome Gateway
Microbial Microbial genomes
Micado Genetics maps of B subtilis and E coli
MycDB A integrated Mycobacterial database
Neisseria Neisseria meningitidis genome
Neurospora Neurospora crassa database
OralGen Oral pathogen database
Salmonella Salmonella information
STDGen Sexulally transmitted disease database
Bass
Bass Sea Bass Mapping project
Cat (Felis catus)
Cat ArkDB Cat mapping database
Cattle (Bos taurus)
ARK Farm animals
BoLA Bovine MHC information
Bovin Bovine genome databaseBovMap Mapping the bovine genome
CaDBase Genetic diversity in cattles
ComRad Comparative radiation hybrid mapping
Cow ArkDB Bovine ArkDB
GemQual Genetics of meat quality
Chicken (Gallus gallus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1157
Chicken Poultry gene mapping project
ChickMap Chicken genome project
Chicken ArkDB Chicken database
ChickEST Chick EST database
Poultry Poultry genome project
Cotton
Cotton Cotton data collection site
Cyano Bacteria (Blue green algae)
Cyano Bacteria Anabaena genome
Daphnia (Crustacea)
Daphnia pulex Daphnia genomics consortium
Deer
Deer ArkDB Deer mapping database
Dictyostelium discoideum
Dicty_cDB Dictyostelium discoideum cDNA project
DGP Dictyostelium discoideum genome project
Dictybase Online informatics resources for Dictyostelium
Dog (Canis familiaris)
Dog Dog genome project
Dog genome project
Frog (Xenopus)
Xenbase A Xenopus web resource
Xenopus Xenopus tropicalis genome
Fruit fly (Drosophila melanogaster)
ENSEMBL Drosophila Genome Browser at ENSEMBL
Fruitfly Drosophila genome project at Berkeley
FlyBase A Database of the Drosophila Genome
FlyMove A Drosophila multimedia database
FlyView A Drosophila image database
Fungus
Aspergillus Aspergillus Genomics
Candida Candida albicans information page
FungalWeb Fungi database
FGSC Fungal genetic stocks center
Goat (Capra hircus)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1257
Goat GoatMap mapping the caprine genome
Horse (Equus caballus)
Horse ArkDB Horse mapping database
Madaka Fish Medaka Medaka fish home page
Maize
Maize Maize genome database
Malaria (Plasmodium spp)
Malaria Malaria genetics and genomics
PlasmoDB Plasmodium falciparum genome database
Parasites Parasite databases of clustered ESTs
Parasite Genome Parasite genome databases
Mosquito
Mosquito Mosquito genome web server
Mouse (Mus musculus)
ENSEMBL Mouse genome server at ENSEMBL
Jackson Lab Mouse Resources
MRC Mouse genome center at MRC UK
MGI Mouse genome informatics at Jackson Labs
MGD Mouse genome database
MGS Mouse genome sequencing at NIH
MIT Genetic and physical maps of the mouse genome
Mouse SNP Mouse SNP database
NCI Mouse repository
NIH NIH mouse initiative
ORNL Mutent mouse database
RIKEN Mouse resources
Rodentia The whole mouse catalog
Pig (Sus scrofa)
INCO Pig trait gene mapping
Pig Pig EST databasePig Pig gene mapping project
PiGBase Pig genome mapping
Pig ArkDB Pig Ark DB
Plants
PlantGDB Resources for plant comparative genomics
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1357
Protozoa
Protozoa Protozoan genomes
Pufferfish
Fugu Puffer fish project UK site
Fugu Fugu genome project SingaporeFugu Puffer fish project USA
Rat (Ratus norvigicus)
MIT Genetic maps of the Rat genome
NIH Rat genomics and genetics
Rat RatMap
RGD Rat genome database
Rice (Oriza sativa)
MPSS Massively parallel signature sequencing
Rice-research Rice genome sequence database
Rice Rice genome project
Rickettsia
RicBase Rickettsia genome database
Salmon
Salmon ArkDB Salmon mapping database
Sheep (Ovis aries)
Sheep Sheep gene mapping
SheepBase Sheep gene mapping
Sheep ArkDB Sheep mapping database
Soy
Soy Soybeans database
Sorghum
Sorghum Sorghum Genomics
Tetraodon
Tetraodon Tetraodon nigroviridis genomeTetraodon Tetraodon nigroviridis genome at Whitehead
Tilapia
HCGS Tilapia genome
Tilapia ArkDB Tilapia mapping database
Turkey
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1457
Turkey ArkDB Turkey mapping database
Viruses
HIV HIV sequence database
Herpes Human herpes virus 5 database
Worm (Caenorhabditis elegans)
C elegans C elegans genome sequencing project
NemBase Resource for nematode sequence and functional data
WormAtlas Anatomy of C elegans
WormBase The Genome and biology of C elegans
ACEDB A C elegans database
WWW Server C elegans web server
Yeast
SCPD The promoter database of Saccharomyces cerevisiae
SGD Saccharomyces genome database
S Pompe Schizosaccharomyces pompe genome project
TRIPLES Functional analysis of Yeast genome at Yale
Yeast Intron database Spliceosomal introns of the yeast
Zebra fish (Danio rerio)
ZFIN Zebrafish information network
ZGR Zebrafish genome resources
ZIS Zebrafish information server
Zebrafish Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional andor structural units of a
protein These two classifications coincide rather often as a matter of fact and what It is
found as an independently folding unit of a polypeptide chain carrying specific
function Domains are often identified as recurring (sequence or structure) units
which may exist in various contexts In molecular evolution such domains may havebeen utilized as building blocks and may have been recombined in different
arrangements to modulate protein function We can define conserved domains as
recurring units in molecular evolution the extents of which can be determined by
sequence and structure analysis Conserved domains contain conservedsequence patterns or motifs which allow for their detection inpolypeptide sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1557
The goal of the NCBI conserved domain curation project is to provide database users with
insights into how patterns of residue conservation and divergence in a family relate to functional
properties and to provide useful links to more detailed information that may help to understand
those sequencestructurefunction relationships To do this CDD Curators include the following
types of information in order to supplement and enrich the traditional multiple sequence
alignments that form the foundation of domain models 3-dimensional structures and conservedcore motifs conserved featuressites phylogenetic organization links to electronic literature
resources
CDD
Conserved DomainDatabase (CDD)
CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequencealignment models for ancient domains and full-length proteins These are available as position-specific score matrices (PSSMs) for fast
identification of conserved domains in proteinsequences via RPS-BLAST CDD content includes NCBI-manually curated domains whichuse 3D-structureinformation to explicitly definedomain boundaries and provide insightsinto sequencestructurefunction relationships aswell as domain models imported from a numberof external sourcedatabases (Pfam SMART COG PRK TIGRFAM)Search How To Help News FTP Publications
CD-Search
amp
Batch CD-Search
CD-Search is NCBIs interface to searching the Conserved
Domain Database with protein or nucleotide querysequences It uses RPS-BLAST a variant of PSI-BLAST to
quickly scan a set of pre-calculated position-specificscoring matrices (PSSMs) with a protein queryThe results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustratedexample) and can be visualized as domain multiplesequence alignments with embedded user queries Highconfidence associations between a query sequence andconserved domains are shown as specific hits The CD-Search Help provides additional details including
information about running CD-Search locally
Batch CD-Search serves as both a web application anda script interface for a conserved domain searchon multiple protein sequences accepting up to 100000proteins in a single job It enables you to view a graphicaldisplay of the concise or full search result for any individual
protein from your input list or todownload the results forthe complete set of proteins The Batch CD-SearchHelp provides additional details
CD-Search (Help amp FTP) Batch CD-Search (Help) Publications
CDARTDomain Architectures
Conserved Domain Architecture Retrieval Tool(CDART) performs similarity searches of the EntrezProtein database based on domain architecture defined asthe sequential order of conserved domains in protein
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1657
queries CDART finds protein similarities across significantevolutionary distances using sensitive domain profilesrather than direct sequence similarity Proteins similar to
the query are grouped and scored by architecture You cansearch CDART directly with a query protein sequence or if a sequence of interest is already in the Entrez Proteindatabase simply retrieve the record open its Links
menu and select Domain Relatives to see theprecalculated CDART results (illustrated example) Relyingon domain profiles allows CDART to be fast and because itrelies on annotated functional domains informativeAbout Search Help FTP Publications
CDTree CDTree is a helper application for your web browser thatallows you to interactively view and examine conserveddomain hierarchies curated at NCBI CDTree works withCn3D as its alignment viewereditor it is used in the CDDcuration process and is a both classification andresearch tool for functional annotation and the study of
protein and protein domain familiesAbout Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from
a number of external source databases (Pfam SMART COG PRK TIGRFAMs) What is unique
about NCBI-curated domains is that they use 3D-structure information to explicitly define domain
boundaries align blocks amend alignment details and provide insights into
sequencestructurefunction relationships Manually curated models are organized hierarchically if
they describe domain families that are clearly related by common descent To provide a non-
redundant view of the data CDD clusters similar domain models from various sources into
superfamilies
Searching the database[edit source]
The collection is also part of NCBIrsquos Entrez query and retrieval system crosslinked to numerousother resources CDD provides annotation of domain footprints and conserved functionalsites on protein sequences Precalculated domain annotation can be retrieved for proteinsequences tracked in NCBIrsquos Entrez system and CDDrsquos collection of models can bequeried with novel protein sequences via the CD-Search service United States NationalCenter for Biotechnology Information or at the Batch CD-Search United States NationalCenter for Biotechnology Information that allows the computation and download of
annotation for large sets of protein queries CDD also contains data from additionalresearch projects such as KOGs (a
eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD)Accessions that start with cl are for superfamily cluster records and cancontaindomain models from one or more source databasesWhen searching CDD it is possible to limit search results to domains fromanygiven source database by using the Database Search Field Phylogeneticorganization Based on evidence from sequence comparison NCBI
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1757
Conserved Domain Curators attempt to organize related domainmodels into
phylogenetic family hierarchies
Links to electronic literature resources NCBI curated domains alsoprovide links tocitations in PubMed and NCBI Bookshelf that discuss the domainThese referencesare selected by curators and whenever possible include articles thatprovideevidence for the biological function of the domain andor discuss theevolution andclassification of a domain family It is also possible to limit CDD search results to domainmodels from any given source database by using theDatabase Search Field
PFAM
Pfam 270 (Mar 2013 14831 families)
Proteins are generally comprised of one or more functional regionscommonly termed domains The presence of different domains in varyingcombinations in different proteins gives rise to the diverse repertoire of proteins found in nature Identifying the domains present in a protein canprovide insights into the function of that protein
The Pfam database is a large collection of protein domain families Eachfamily is represented by multiple sequence alignments and hidden Markovmodels (HMMs)
There are two levels of quality to Pfam families Pfam-A and Pfam-B Pfam- A entries are derived from the underlying sequence database knownas Pfamseq which is built from the most recent release of UniProtKB at agiven time-point Each Pfam-A family consists of a curated seed alignmentcontaining a small set of representative members of the family profilehidden Markov models (profile HMMs) built from the seed alignment and anautomatically generated full alignment which contains all detectable proteinsequences belonging to the family as defined by profile HMM searches of primary sequence databases
Pfam-B families are un-annotated and of lower quality as they are generatedautomatically from the non-redundant clusters of the latest ADDA releaseAlthough of lower quality Pfam-B families can be useful for identifyingfunctionally conserved regions when no Pfam-A entries are found
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1857
Pfam entries are classified in one of four ways
FamilyA collection of related protein regions
DomainA structural unit
RepeatA short unit which is unstable in isolation but forms a stable structurewhen multiple copies are present
MotifsA short unit found outside globular domains
Related Pfam entries aregrouped togetherinto clans therelationship may bedefined by similarity of
sequence structure orprofile-HMM
1 2 3 4 5 6
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 1957
Pfam 270 (March 2013 14831 families)
Pfam also generates higher-level groupings of related families known
as clans A clan is a collection of Pfam-A entries which are related by
similarity of sequence structure or profile-HMM
QUICK LINKS
SEQUENCE SEARCH
VIEW A PFAM FAMILY
VIEW A CLAN
VIEW A SEQUENCE
VIEW A STRUCTURE
KEYWORD SEARCH
JUMP TO
YOU CAN FIND DATA IN PFAM IN VARIOUS WAYS
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2057
See groups of related families
Look at the domain organisation of a protein sequence Find the domains on a PDB structure
Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan
UniProt sequence PDB structure etc
Browse Pfam
You can use the links below to find lists of families clans orproteomes which begin with the chosen letter (or number) You can
also see a list of Pfam families which are new to this release or the listof the twenty largest families in terms of number of sequences
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM matchThe envelope coordinates delineate the region on the sequence where thematch has been probabilistically determined to lie whereas the alignment coordinates delineate the region over which HMMER is confident that thealignment of the sequence to the profile HMM is correct Our full alignmentscontain the envelope coordinates from HMMER3
Architecture
The collection of domains that are present on a protein
Clan
A collection of related Pfam entries The relationship may be defined bysimilarity of sequence structure or profile-HMM
Domain
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2157
A structural unit
Domain score
The score of a single domain aligned to an HMM Note that for HMMER2 if
there was more than one domain the sequence scorewas the sum of all thedomain scores for that Pfam entry This is not quite true for HMMER3
DUF
Domain of unknown function
Envelope coordinates
See Alignment coordinates
Family
A collection of related protein regions
Full alignment
An alignment of the set of related sequences which score higher than themanually set threshold values for the HMMs of a particular Pfam entry
Gathering threshold (GA)
Also called the gathering cut-off this value is the search threshold used tobuild the full alignment The gathering threshold is assigned by a curatorwhen the family is built The GA is the minimum score a sequence mustattain in order to belong to the full alignment of a Pfam entry For each PfamHMM we have two GA cutoff values a sequence cutoff and a domain cutoff
HMMER
The suite of programs that Pfam uses to build and search HMMs Since Pfamrelease 240 we have used HMMER version 3 to make Pfam Seethe HMMER site for more information
Hidden Markov model (HMM)
A HMM is a probablistic model In Pfam we use HMMs to transform theinformation contained within a multiple sequence alignment into a position-specific scoring system We search our HMMs against the UniProt proteindatabase to find homologous sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2257
HMMER3
The suite of programs that Pfam uses to build and search HMMs Seethe HMMER site for more information
iPfam
A resource that describes domain-domain interactions that are observed inPDB entries Where two or more Pfam domains occur in a single structure itanalyses them to see if the are close enough to form an interaction If theyare close enough it calculates the bonds forming the interaction
Metaseq
A collection of sequences derived from various metagenomics datasets
Motif
A short unit found outside globular domains
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of
representative sequences We manually set a threshold value for each HMMand search our models against the UniProt database All of the sequnceswhich score above the threshold for a Pfam entry are included in the entrysfull alignment
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues fromthem Since Pfam-B families are automatically generated we recommendthat you verify that the sequences in a Pfam-B family are related using othermethods such as BLAST For Pfam 240 we have made HMMs for the first(and therefore largest) 20000 Pfam-B familes Users can search theirsequences against the Pfam-B HMMs in addition to the Pfam-A HMMs whenperforming both single-sequence searches and batch searches on thewebsite
Posterior probability
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2357
HMMER3 reports a posterior probability for each residue that matches amatch or insert state in the profile HMM A high posterior probability showsthat the alignment of the amino acid to the matchinsert state is likely to becorrect whereas a low posterior probability indicates that there is alignmentuncertainty This is indicated on a scale with being 10 the highestcertainty down to 1 being complete uncertainty Within Pfam we display thisinformation as a heat map view where green residues indicate high posteriorprobability and red ones indicate a lower posterior probability
Repeat
A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
Seed alignment
An alignment of a set of representative sequences for a Pfam entry We usethis alignment to construct the HMMs for the Pfam entry
Sequence score
The total score of a sequence aligned to a HMM If there is more than onedomain the sequence score is the sum of all the domain scores for that Pfamentry If there is only a single domain the sequence and the domains scorefor the protein will be identical We use the sequence score to determinewhether a sequence belongs to the full alignment of a particular Pfam entry
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures More than 500 domain families foundin signalling extracellular and chromatin-associated proteins are detectable These domains are extensivelyannotated with respect to phyletic distributions functional class tertiary structures and functionally importantresidues Each domain found in a non-redundant protein database as well as search parameters and
taxonomic information are stored in a relational database system User interfaces to this database allowsearches for proteins containing specific combinations of domains in defined taxa
Features
You can use SMART in two different modes normal or genomicThe main difference is in the underlying
protein database used In Normal SMART the database contains Swiss-Prot SP-TrEMBL and stable
Ensembl proteomes In Genomic SMART only the proteomes of completely sequenced genomes are used
Ensembl for metazoans and Swiss-Prot for the rest
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2457
The protein database in Normal SMART has significant redundancy even though identical proteins are
removed If you use SMART to explore domain architectures or want to find exact domain counts in various
genomes consider switching to Genomic mode The numbers in the domain annotation pages will be more
accurate and there will not be many protein fragments corresponding to the same gene in the architecture
query results We should Remember that we are exploring a limited set of genomes though
Different color schemes are used to easily identify the mode we are in
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlayin three dimensions Alignments held by SMART are mostly based on published observations (seedomain annotations for details) but are updated and edited manually
Alignment block
Ungapped alignments that usually represent a single secondary structure
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores The likelihood that the querysequence is a bona fide homologue of the database sequence is compared to the likelihood thatthe sequence was instead generated by a random model Taking the logarithm (to base 2) of thislikelihood ratio gives the bits score
BLAST Basic local al ignment search tool
An excellent database searching tool developed at the National Center for BiotechnologyInformation (NCBI) ([1] [2] [3] [4]) SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure WU-BLAST is used for nrdb searches with user supplied sequences
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleusInteraction (with the environment)Molecules that sense cellular environmental change such as osmolarity light flux acidity ionconcentration etcMetabolic Enzymes that catalyze reactions in living cells that transform organic moleculesReplication The process of making an identical copy of a section of duplex (double-stranded) DNA usingexisting DNA as a template for the synthesis of new DNA strands
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2557
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by acellular responseTransport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrientsions etc across the membraneTranslation
The process in which the genetic code carried by messenger RNA directs the synthesis of proteinsfrom amino acidsTranscription The synthesis of an RNA copy from a sequence of DNA (a gene) the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1] [2] [3]) Coiled coils are detected inSMART using the method of Lupas et al ([4] COILS home) Coiled coils predictions are indicatedon the second line in SMARTs graphical output
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic coreIn small disulphide-rich and Zn
2+-binding or Ca
2+- binding domains the hydrophobic core may be
provided by cystines and metal ions respectively Homologous domains with common functions
usually show sequence similarities
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of thequery
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)
Entrez
A WWW-based system that allows easy retrieval of sequence structure molecular biology andliterature data (Entrez) SMARTs domain annotation pages contain links to the Entrez systemthereby providing extensive literature structure and sequence information
E-value
This represents the number of sequences with a score greater-than or equal to X expectedabsolutely by chance The E-value connects the score (X) of an alignment between a user-supplied sequence and a database sequence generated by any algorithm with how manyalignments with similar or greater scores that would be expected from a search of a randomsequence database of equivalent size Since version 20 E-values are calculated using HiddenMarkov Models leading to more accurate estimates than before
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus
Gap
A position in an alignment that represents a deletion within one sequence relative to another Gappenalties are requirements for alignment algorithms in order to reduce excessively-gapped regionsGaps in alignments represent insertions that usually occur in protruding loops or beta-bulges withinprotein structures
Genomic database
Protein database used in SMARTs Genomic mode It contains data from completely sequencedgenomes only Ensembl data is used for Metazoan genomes and Swiss-Prot for others A completelist of genomes in the database is avaliable
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2657
HMM Hidden Markov m odel
HMMs are statistical models of the sequence consensus of an homologous family (see the docu) Aparticular class of HMMs has been shown to be equivalent to generalised profiles (8867839)
Applications of HMMs to sequence analysis are nicely provided by HMMer and SAM
HMM consensus
The HMM consensus is a one l ine summary of the corresponding HMM The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM Capitalletters mean highly conserved residues (probability gt 05 for protein models) (modified from theHMMer Users Guide)
HMMer
The HMMer package ([1] [2]) provides multiple alignment and database searching capabilitiesThere are several programs in the package (see the docu) including one (hmmfs) that searchesdatabases for non-overlapping LOCAL similarities (ie that match across at least part of the HMM)and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (ie thatmatch across the full HMM) These correspond approximately to profile-based searches usingnegative and positive profiles respectively (see WiseTools) Database searches using hmmls or hmmfs provide alignment scores as bits scores
Homology
Evolutionary descent from a common ancestor due to gene duplication
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm extracellular space nucleus and membrane-associated ) are shown inannotation pages
Motif
Sequence motifs are short conserved regions of polypeptides Sets of sequence motifs need notnecessarily represent homologues
NRDB non-redundant database
A database that contains no identical pairs of sequences It can contain multiple sequencesoriginating from the same gene (fragments alternative splicing products) SMART in Standardmode uses such database
ORF
Open reading frame
Outlier homologues
These are often difficult to detect using HMM methodology A complementary approach to their
detection is to query a database of sequences taken from multiple sequence alignments usingBLASTSelecting this option will also activate searches against sequence databases derived from proteinsof known structure A simple BLAST search of the PDB is performed together with a search of RPS_Blast profiles derived from SCOP These profiles were kindly provided by Steffen Schmidt(see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
P-value
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2757
This represents a probability that given a database of a particular size random sequences scorehigher than a value X P-values are generated by the BLAST algorithm that has been integratedinto SMART
PDB protein data bank
PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven natLabs or EBI) Domain families represented in SMART and in the PDB are annotated as being of known structure links are provided in SMART to the PDB via PDBsum and MMDB PDBsum linkscan be used to access a variety of sequence-based and structure-based tools whereas MMDBprovides access to literature information and structure similarities
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments and (ii) HMM-profiles ([1] [2]) Pfam WWW servers allow comparison of user-supplied sequences with the Pfamdatabase (Sanger Center and Washington Univ)SMART contains a facility to search the Pfam collection using HMMer
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems These can befound mainly in Prokaryotes but a few were also found in eukaryotes like yeast and plants
Profile
A profile is a table of position-specific scores and gap penalties representing an homologousfamily that may be used to search sequence databases (Ref [1] [2] [3])In CLUSTAL-W-derived profiles those sequences that are more distantly related are assignedhigher weights ([4] [5] [6]) Issues in profile-based database searching are discussed in Bork ampGibson (1996) [7]
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against adatabase of profiles (located at the ISREC)
PROSITE
This is a dictionary of protein sites and motif patterns Some SMART domain annotations containlinks to PROSITE
Schnipsel database domain sequence database
Schnipsel is a German word meaning snippet or fragment The schnipsel database consists of the sequences off all domains found with SMART in NRDB Outliers of a family often cannot bedetected by a profile yet are detectable by pairwise similarity to one or more established membersof a sequence family So searching against the schnipsel database gives complementaryinformation to the profile searches
Searched domains
In the first version of SMART only eukaryotic signalling domains could be searched In 1998 wehave extended this set by prokaryotic signalling and extracellular domains In the input page youcan choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic) extracellular or all domains
Secondary Literature
The secondary literature is derived by the following procedure For each of the hand selectedpapers referenced by a domain 100 neighbouring papers are retrieved using Medline If one of these neighbouring papers is referenced from more than two original papers it is included into thesecondary literature list
Seed Alignment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2857
Alignment that contains only one of each pair of homologues that are represented in aCLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 02 (seethe related article)
SEG
A program of Wootton amp Federhen [1] that detects regions of the query sequence that have lowcompositional complexity [2]
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate aquery You can use either Uniprot or Ensembl sequence identifiers
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria
1 cytoplasmic domains that possess kinase phosphatase ubiquitin ligase or phospholipaseenzymatic activities or those that stimulate GTPase-activation or guanine nucleotideexchange
2 cytoplasmic domains that occur in at least two proteins with different domain
organisations of which one also contains a domain that satisfies criterion 1
These domains mostly mediate or regulate the transduction of an extracellular signal towards thenucleus resulting in the initiation of a cellular response More recently prokaryotic two-componentsignalling domains have been added to the SMART set
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acidsequences (SignalP home page)
SMART Simple Modular Arc hi tecture Research Tool
Our main goal in providing this tool is to allow automatic identification and annotation of domains inuser-supplied protein sequences
Species
Numbers of domains present in a variety of selected taxa(animal archaea bacteria fungi plants and protozoa) are shown in annotation pages
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of proteinsequences SwissProt annotations have been mined for SMART-derived annotations of alignments
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2)
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguishbetween true and false hits The different thresholds are described in the SMARTpaper
WiseTools
A package that is based on database searches using profiles Profiles may be generated usingPairWise and then compared with sequence databases using SearchWise Scores are generatedfor alignments that match the whole of the profile (using a positive profile) or else that match atleast part of the profile (using a negative profile) Only the top-scoring optimal alignment of each
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 2957
sequence is reported hence SMART relies on re-iterating the search for new repeats until noneare reported that score above threshold Score thresholds have been set manually that areconsidered to represent a score just above the top-scoring true negative Additional thresholdshave been estimated for domains that are repeated in single polypeptides A more recent packageallows comparison of DNA sequences at the level of their conceptual translations regardless of sequence error and introns (see Wise2)
Whatrsquos new
Changes from version 60 to 70
Full text search engineperform a full text search of SMART and Pfam domain annotations p lus the completeprotein descriptions for Uniprot and Ensembl proteins
metaSMART Explore and compare domain architectures in various publicly available metagenomicsdatasets
iTOL export and visualizationDomain architecture analysis results can be exported and visualized in interactive Tree Of
Life New option can be found in the protein list function select list User interface cleanup
Various small changes to the UI resulting in faster and easier navigation
Changes from version 51 to 60
Metabolic pathways information
SMART domains and proteins available in the genomic mode now have basic metabolicpathways information It is generated by mapping our genomic mode protein database to theKEGG orthologous groups
Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completelysequences species
Changes from version 50 to 51
SMART webservice
You can access SMART using our webservice Check the WSDL file for details The webservice isstill under active development and only works for sequences and IDs which are in our database(complete Uniref100 and stable Ensembl genomes)
SMART DAS server SMART DAS server is available at URL httpsmartembldesmartdas It provides all the proteinfetures from our database (SMART domains signal peptides transmembrane regions and coliedcoils) for all Uniref100 and Ensembl proteins
If you need help with these services or have questionsfeedback please contact us
Changes from version 41 to 50
New protein database in Normal mode SMART now uses Uniprot as the main source of protein sequences All Ensembl proteomes(except pre-releases) are also included To lower redundancy in the database the followingprocedure is used
o only one copy of 100 identical proteins is kept (different IDs are still available)o each species proteins are separatedo CD-HIT clustering with 96 identity cutoff is preformed on each species separately
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3057
o longest member of each protein cluster is used as the representative o only representative cluster members and single proteins (ie proteins which are not
members of any clusters) are used in all domain architecture queries and for domaincounts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around29 million proteins) the redundancy should be minimal
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins SMARTnow predicts the taxonomic class where the concept of a protein that is its domain architecturewas invented The domain architecture is defined as the linear order of all SMART domains in theprotein sequence To derive the point of its invention all proteins with the same domainarchitecture are mapped onto NCBIs taxonomy The last common ancestor of all organismscontaining at least one protein with the domain architecture is defined as the point of its origin
Changes from version 40 to 41
Two modes of operation Normal and Genomic
For more details visit the change mode page
Intrinsic protein disorder prediction You can now include DisEMBL prediction of protein disorder in your search parameters Theresults are displayed as blue regions in protein schematics DisEMBLs HOTLOOPS and REM465methods are currently used Visit the DisEMBL page for more information on the method
Catalytic activity check SMART now includes data on specific requirements for catalytic activity for some of our domains(50 catalytic domains at the moment full list here) If the required amino acids are not present inthe predicted domain it will be marked as Inactive Domain annotation page will show you thedetails on which amino acids are missing and the links to relevant literature Pubmed link
Taxonomic trees
Architecture query results are now displayed as simple taxonomic trees In addition to individualproteins you can select any taxonomic node (or multiple nodes) and display all the proteins inthose nodesThe Evolution section of domain annotation pages is now also represented as a taxonomic tree
The basic tree contains only several hand-picked representative species with a link to the full tree User interface redesigned
SMART has been completely rewriten and all pages are conforming to the XHTML 10Strict and CSS Level 2 standards We recommend a modern standards compliant browser for thebest experience Mozilla Firefox and Opera are our favorites
Changes from version 35 to 40
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions SMART will show intron positions asvertical coloured lines in graphical representations (see example) This information is retrieved froma pre-calculated mapping of Ensembl gene structures to protein sequences
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3157
Vertical line at the end of the protein is not an actual intron but a mark to show that intron mappingwas performed If that is the only line there are no introns annotated If there is no line at al l thereis no data avaliable in Ensembl for that particular sequence
You can switch off intron display on your SMART preferences page
Alternative splicing information
Since SMART now incorporates Ensembl genomes Additional information page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any) It is possible toeither display SMART protein annotation for any of the alternative splices or get a graphicalmultiple sequence alignment of all of them
Orthology information
There are 2 separate sets of orthologs for each Ensembl protein 11 reciprocal best matches inother genomes and orthologous groups with reciprocal best hits from all genomes analyzed (ieeach of these proteins has exactly one ortholog in all 6 genomes)This data is displayed on Additional information page
Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple
sequence alignments Proteins are aligned using ClustalW Domains intrinsic features and intronsare mapped onto the alignment with their positions adjusted according to gaps (black boxes)
Changes from version 34 to 35
Features of all Ensembl genomes are stored in SMART
You can use standard Ensembl protein identifiers (for example ENSP00000264122) andsequences in all queries
Batch access Quickly display results for hundreds of sequences Currently limited to 500 sequences per access2000 per day
Smarter handling of identical sequences If there are multiple IDs associated with a particular sequence an extra table will be displayed
showing all of them
Changes from version 33 to 34
Search structure based profiles using RPS-Blast
Clicking on the search schnipsel and structures checkbox will now also initiate a search of profilesbased on scop domain families using RPS-Blast These profiles were kindly provided by SteffenSchmidt (see Schmidt et al J Chem Inf Comput Sci 2002 (42) 405-7)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3257
Improved Architecture analysis In addition to standard Domain selection querying it is now possible to do queries based on GO(Gene ontology click here for more info) terms associated with domains In the first step you get alist of domains matching the GO terms entered After selecting the domains of interest from the listproteins containing those domains are displayed Use Taxonomic selection box to limit the resultsbased on taxonomic ranges
Pfam domains are stored in the database
SMART database now contains precomputed results for all Pfam domains To use Pfam domainsin the architecture queries prepend the domain name with Pfam (for example TyrKc ANDPfamFz AND TRANS)
Try our SMART Toolbar for Mozilla web browser Click here for more info
Changes from version 32 to 33
Fantastic new protein picture generator
Proteins are now displayed as dynamically generated PNG images This means that you candownload the entire protein representation as a single image You have our permission to do thisand use these diagrams in any way that you like do acknowledge us though All domain bubbleshave been script-generated using The Gimp and itsPerl-Fu extension The script is GPLd and isavaliable for download
Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadts excellent CHROMA programCHROMA is available from here and its great You may experience some problems if youre usinga klunky browser This will be fixed when you change your browser
Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ) kindlyprovided by Anders Krogh and co-workers You can read about the method here
Selective SMART We now store intrinsic features such as transmembrane domains and signal peptides in our database This means that these can be queried for (eg SIGNAL AND TyrKc) The feature namesare Signal peptide SIGNAL transmembrane domain TRANS coiled-coli COIL
Techincal changes
SMART code has been modified to run under Apache mod_perl module SMART is now using ApacheDBI for persistent database connections Database engine has been updated to thePostgreSQL 71 These changes resulted in significant speed improvements
Bug fixes as usual show_many_proteins script now uses POST method so there is no longer a limit on number of proteins you can display Fixed a display problem with proteins having thousands of differentrepresentations (if you try to display those make sure you have a good browser (like Mozilla or Opera -) and a bunch of RAM)
Changes from version 31 to 32
Start page There is an indicator of the current database status in the header now If the database is down or is being updated youll know immediately
Literature Changed literature identifiers to PMID New secondary literature generator that parses all
neigbouring papers not just first 100 Numerous small bug fixes and improvements
Changes from version 30 to 31
Startup page The start page now includes selective SMART and allows to search for keywords in the annotationof domains
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3357
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm Additionally PDB issearched
Taxonomic breakdown
When selecting multiple proteins eg via selective SMART or the annotation pages an overview of the taxonomy of all species (tax break) is offered
Links
Links dont contain version numbers This allows stable links from external sources Selective SMART
Allows to search for multiple copies of domains and is case insensitive You can now search for eg Sh3 AND sH3 AND sh3
Annotation You can align youre query sequence to the SMART alignment using hmmalign
Update of underlying database SMART now uses PostgreSQL 652
Changes from version 20 to 30
Digest output SMART now only produces a single diagram representing a best interpretation of all the
annotation that has been performed A comprehensive summary of the results is also provided intable format
selective SMART
Selective SMART allows to look for proteins with combinations of specific domains in differentspecies or taxonomic ranges
alert SMART The SMART database gets updated about once a week If you are interested in specific domains or combination of them in specific taxonomic ranges you can use the SMART alerting service Thisprovides the identities of newly-deposited proteins that match your query
Domain queries You can ask for proteins having the same domain order composition as your query protein
SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomycescerevisiae via the annotation pages
Faster PFAM searches The PFAM searches now runs on a PVM cluster
Changes from version 103 to 20
Domain coverage
The original set of signalling domains has now been extended to include extracellular domains
Default search method is now HMMer The previously-used underlying methodology of SMART (ie SWise) tended to over-extend gapsleading to problems in defining domain borders Although a SWise search of the SMART databaseremains possible (using the Wise searching SMART database option available from the HomePage) the default searching method is now HMMer2 This now allows improved statisticalestimates (Expectation- or E-values) of the significance of a domain hit
Complete rewrite of the annotation pages As with the previous version (103) annotation pages provide information concerning domainfunctions and include hyperlinks to PubMed However we now offer automatically-derived datashowing the taxonomic range and the predicted cellular localisation of proteins containing thedomain in question
Literature database
In addition to the manual derived literature sources given in version 103 we now provideautomatically-derived secondary literature from the annotation pages These are extracted fromPubMed and are cross-linked to additional abstracts
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3457
The CATH domain structure database new protocols and
classification levels give a more comprehensive resource for exploring evolution
Lesley H Greene Tony E Lewis Sarah Addou Alison Cuff Tim Dallman MarkDibley Oliver Redfern Frances Pearl Rekha Nambudiry Adam Reid Ian Sillitoe CorinYeats Janet M Thornton1 and Christine A Orengo
Author information Article notes Copyright and License information
This article has been cited by other articles in PMC
Go to
WHATrsquoS NEW
We report the latest release (version 30) of the CATH protein domain
database (httpwwwcathdbinfo) There has been a 20 increase in the
number of structural domains classified in CATH up to 86 151 domains
Release 30 comprises 1110 fold groups and 2147 homologous
superfamilies To cope with the increases in diverse structural
homologues being determined by the structural genomics initiatives
more sensitive methods have been developed for identifying boundaries
in multi-domain proteins and for recognising homologues The CATHclassification update is now being driven by an integrated pipeline that links theseautomated procedures with validation steps that have been made easier by theprovision of information rich web pages summarising comparison scores andrelevant links to external sites for each domain being classified An analysis of thepopulation of domains in the CATH hierarchy and several domain characteristics arepresented for version 30 We also report an update of the CATH Dictionary
of homologous structures (CATH-DHS) which now contains multiple
structural alignments consensus information and functional
annotations for 1459 well populated superfamilies in CATH CATH is
directly linked to the Gene3D database which is a projection of CATH
structural data ontosim
2 million sequences in completed genomes andUniProt
Go to
GENERAL INTRODUCTION
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3557
The numbers of new structures being deposited in the Protein Data Bank
(PDB) continues to grow at a considerable rate In addition structures
being targeted by world wide structural genomics initiatives are more
likely to be novel or only very remotely related to domains previously
classifiedOnly 2 of structures currently solved by conventional
crystallography or NMR are likely to adopt novel folds (see Figures 1 and
and2)2) A higher proportion of new folds are expected to be solved by
structural genomics structures Although the influx of more diverse
structures and subsequent analysis will inform our understanding of
how domains evolve it has resulted in increasing lags between the
numbers of structures being deposited and classified In response to thissituation we have significantly improved our automated and manual protocols fordomain boundary assignment and homologue recognition
Figure 1 Annual decrease in the percentage of new structures classified in CATH which areobserved to possess a novel fold The raw data for years 1972ndash2005 was fit to a singleexponential equation by nonlinear regression using Sigma Plot (SPSS Version
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified inCATH rejected or pending classification The colour scheme reflects differentcategories of PDB chains Black not accepted by the CATH criteria Redunprocessed chains
Significant changes have been implemented in the CATH classification protocol toachieve a more highly automated system A seamless flow of structures between theconstituent programs has been achieved by building a pipeline which integrates webservices for each major comparison stage in the classification (see Figure 3)Secondly completely automatic decisions are now being made for new proteinchains with close relatives already assigned in the CATH database There are twosituations that preclude the CATH update process from being fully automated Werely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds andhomologues (HomCheck stage) These two manual stages will remain an integralpart of the system (Figure 3)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3657
Figure 3 Flow diagram of the CATH classification pipeline This schematic illustrates theprocesses involved in classifying newly determined structures in CATH The CATHupdate protocol workflow from new chain to assigned domain is split into two mainprocesses
In this paper we report our ongoing development of the automated proceduresThese critical new features should better enable CATH to keep pace with the PDB (3)
and facilitate its development Key statistics on the domain structure populationsand characteristics are also presented
Go to
A REVISED CATH CLASSIFICATION HIERARCHY CATHSOLID
In order to provide information on the sequence diversity between superfamily members we have introduced additional levels into the CATH hierarchy The
CATH hierarchal classification scheme now consists of nine levels Classis derived from secondary structure content and A rchitecture describes the gross
orientation of secondary structures independent of connectivity The Topology levelclusters structures into fold groups according to their topological connections andnumbers of secondary structures The Homologous superfamilies cluster proteins with highly similar structures sequences andor functions (45) The new extensionof the CATH classification system now includes five lsquoSOLIDrsquo sequence levels S O LI further divides domains within the H-level using multi-linkage clustering based onsimilarities in sequence identity (35 60 95 and 100) (see Table 1) The D-level actsas a counter within the I-level and is appended to the classification hierarchy toensure that every domain in CATH has a unique CATHsolid identification code(see Table 1) Specific details on the nature of the SOLID-levels can be found in thelsquoGeneral Informationrsquo section of the CATH website httpwwwcathdbinfo CATHonly includes experimentally determined protein structures with a 4 Aring resolution or better 40 residues in length or longer and having 70 or more side chains resolved
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3757
Table 1 CATH version 30 statistics
Go to
DOMAIN BOUNDARY ASSIGNMENTS
We have further improved our automated domain boundary prediction methodmdashCATHEDRAL (4) This is used to search a newly determined multi-domain structureagainst a library of representative structures from different fold groups in the CATHdatabase to recognise constituent domains CATHEDRAL performs an initial rapid
secondary structure comparison between structures using graph theory to identify putative fold matches which are then more carefully aligned using a slower moreaccurate dynamic programming method A new scoring scheme has beenimplemented which combines information on the class of the domains beingcompared their sizes similarity in the structural environments and the number of equivalent residues A support vector machine is used to combine the different scoresand select the best fold match for each putative domain in a new multi-domainprotein
Benchmarking against a set of 964 lsquodifficultrsquo multi-domain chains whose 1593constituent domains were remotely related to folds in CATH (lt35 sequenceidentity) and originated from 245 distinct fold groups and 462 superfamilies showedthat 90 of domains within these chains could be assigned to the correct fold groupand for 78 of them the domains boundaries were within plusmn15 residues of boundaries assigned by careful manual validation Larger variations in domain boundaries are often due to the fact that in many families significant structural variation can occur during evolution so that distant relatives vary considerably insize If no close relative has been classified in CATH it is likely that the only CATHEDRAL match will be to a relative with significant structural embellishmentsthus making it harder to determine the correct boundaries
Since domain boundary assignment of remote homologues is one of the most timeconsuming stages in the classification we combine multiple information for each new structure on a web page to guide manual curation Pages display scores from a rangeof algorithms which include structure based methods CATHEDRAL (4) SSAP (67)DETECTIVE (8) PUU (9) DOMAK (10) sequence based methods such as hiddenMarkov Models (HMMs) (11) and relevant literature These pages are now viewablefor information on putative boundaries for new multi-domain structures currently
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3857
being classified in CATH (eg httpwwwcathdbinfocgi- bincathChainplchain_id=2g3aA )
For protein chains which are closely related to chains that are already chopped inCATH an automated protocol has been developed (ChopClose) ChopClose identifies
any previously chopped chains that have sufficiently high sequence identity andoverlap with the query chain Using SSAP (6) the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across thealignment The process of inheriting the boundaries often requires some adjustmentsto be made to account for insertions deletions or unresolved residues If theinheritance from one of the chains meets various criteria (SSAP score ge80 sequenceidentity gt80 RMSD le60 Aring longest end extension le10 residues etc) then theresulting boundaries are used to chop the chain automatically mdash AutoChop For cases where ChopCloses best result does not meet all the criteria for automatic choppingit is provided as support information for a manual domain boundary assignment
Refer to Figure 3 for the location of AutoChopChopClose within the CATH updateprotocol
Go to
NEW HOMOLOGUE RECOGNITION METHODS
We have assessed a number of HMM based protocols for improving homologuerecognition A new protocol (Samosa) exploiting models built using multiplestructure alignments to improve accuracy gives some improvements in sensitivity (4ndash5) However a protocol exploiting an 8-fold expanded HMM library based onsequence relatives of structural domains gives an increase of nearly 10 insensitivity (12) In addition HMMndashHMM-based approaches have been implementedusing the PRC protocol of Madera and co-workers (httpsupfamorgPRC) Theseallow recognition of extremely remote homologues some of which are not easily detected by the structure comparison methods (discussed further below) HMM based database scans developed for the CATH classification protocol are collectively referred to as HMMscan below
For some very remotely related homologues confidence in an assignment can beimproved by combining information from multiple prediction methods We have
investigated the benefits of using machine learning methods to do this automatically A neural network was trained using a dataset of 14 000 diverse homologues (lt35sequence identity) and 14 000 non-homologous pairs with data from differenthomologue comparison methods including structure comparison (CATHEDRALSSAP) sequence comparison (HMMndashHMM) and information on functionalsimilarity The latter was obtained by comparing EC classification codes betweenclose relatives of the distant homologues and using a semantic similarity scoring
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 3957
scheme for comparing GO terms based on a method developed by Lord et al (13)On a separate validation set of 14 000 homologous pairs and 14 000 non-homologous pairs 97 of the homologues can be recognised at an error rate of lt4
Go to
NEW UPDATE PROTOCOL
Automatic methods
Previously CATH data was generated using a group of independent
programs and flat files Over the past two years we have developed an
update protocol for CATH that is driven by a suite of programs with a
central library and a PostgreSQL database system A classification
pipeline has been established which links in a completely automated
fashion the different programs that analyse the sequences and structuresof both protein chains and domains The CATH update protocol can
essentially be divided into two parts domain boundary assignment and
domain homology classification (see Figure 3) The aim of the protocol is
to minimise manual assignment and provide as much support as
possible when manual validation is necessary
Processing of both parts of the classification protocol are similar requiring relatedmeta-data and the triggering of the same automated algorithms Methods includepairwise sequence similarity comparisons and scans by other homologue detection or
fold recognition algorithms such as HMMscan and CATHEDRAL that provide datafor either manual or automated assignment Many of the automated steps in theprotocol have been established as a web service and the pipeline integrates bothautomated steps together with lsquoholding stagesrsquo in which domains are held prior toprocessing and await the completion of manual validation of predictions (see below)
Web pages to support manual validation
For each manual stage (domain boundary assignmentmdashDomChop and homologuerecognitionmdashHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (eg DBS
CATHEDRAL HMMscan for DomChop CATHEDRAL HMMscan for HomCheck)and information from the literature and from other family classifications withrelevant data (eg Pfam) For each protein or domain shown on the pagesinformation on the statistical significance of matches is presented The web pages will shortly be made viewable and will provide interim data on protein chains anddomains not fully classified in CATH for biologists interested in any entries pendingclassification
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4057
Go to
OVERVIEW OF THE CURRENT RELEASE (VERSION 30)
Assigning domain boundaries and relationships between protein structures is
computationally challenging Since the last CATH release version (26) the numberof domains in the CATH database has increased by 20 in version 30 and now totals 86 151 This is a more than 10-fold increase in the number of domainsclassified in CATH since its creation Improvements in automation and also in the web based resources used to aid manual validation have allowed us to increase theproportion of hard-to-classify structures processed in CATH and this is reflected in asignificant increase in the proportion of new folds in the databasemdashnow more than1000 The detailed breakdown of numbers of domains in the nine CATH levels isgiven in Table 1 We conducted an analysis of the domains in version 30 and havederived statistics for several fundamental features
Percentage of new topologies
An analysis of the percentage of new folds arising since the early 1970s to the presentage is shown in Figure 1 The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen thatcurrently approximately 2 of new structures classified in CATH are observed to benovel folds For comparison the number of domain structures solved over time isalso graphically represented in Figure 1
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries We conducted an analysis of the number of chains versus number of domains in achain It is interesting to note that 64 of all protein structures currently solved andclassified in CATH are single domain chains (data not shown) The next mostprevalent are two domain chains (27) and following this we find that the number of chains containing three or more domains rapidly decreases The average size of thesingle domain chains is 159 residues in length
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated Data on structural similarity andsuperfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14) The DHS also providesfunctional annotations of domains within each H-level (superfamily) in CATH v 251
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4157
For each superfamily pair-wise structural similarity scores between relativesmeasured by SSAP are presented The DHS now contains 3307 multiple structuralalignments for 1459 superfamilies For each superfamily multiple alignments aregenerated for all the relatives and also for subgroups of structurally similar relativesand sequence similar relatives Alignments are performed using the residue-basedCORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a2DSEC diagram (16) alongside co-ordinate data of the superposed structures in PDBformat Sequence representations of the alignments are available to download inFASTA format In the CORAPLOT images of the multiple alignment residues in eachdomain are coloured according to ligand binding and residue type EquivSEC plotsare also shown that describe the variability in orientation and packing betweenequivalent secondary structures (16)
To identify sequence relatives for CATH superfamilies sequences from UniProt (17) were scanned against HMMs of all CATH domains (12) Homologous sequences were
identified as those hits with an E -value lt 001 and a 60 residue overlap with theCATH domain This protocol recognised over one million domain sequences inUniProt which could be integrated in the CATH-DHS The harvested sequences ineach superfamily were compared against other relatives by BLAST (18) to determinethe pair-wise sequence identity and then clustered at appropriate levels of sequenceidentity (35 and 95) using multi-linkage clustering Information and links to otherfunctional databases ENZYME (19) GO (Gene Ontology Consortium 2000) KEGG(20) COG (21) SWISSPROT (22) are also included by BLASTing the sequences fromeach superfamily against sequences provided by these resources Only 95 sequenceidentity hits with an 80 residue overlap which were used to annotate sequences
Recent analysis of structural and functional divergence in highly populated CATHsuperfamilies (gt5 structural relatives with lt35 sequence identity) has beenundertaken using data from the DHS The 2DSEC algorithm was used to analysemultiple structural alignments of families and identify highly conserved structuralcores and secondary structure embellishments or decorations to the common coreIn some large superfamilies extensive embellishments were observed outside thecore and although these secondary structure insertions were frequently discontinuous in the protein chain they were often co-located in 3D space (16) Inmany cases manual inspection revealed that the embellishment had aggregated toform a larger structural feature that was modifying the active site of the domain orcreating new surfaces for domain or protein interactions Data collected in the DHSclearly shows a relationship between structural divergence within a superfamilysequence divergence of this superfamily amongst predicted domains in the genomesand the number of distinct functional groups that can be identified for thesuperfamily (see Figure 4)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4257
Figure 4 Relationship between sequence variability structural variability and functionaldiversity in CATH superfamilies Structural variation in a CATH superfamily asmeasured by the number of diverse structural subgroups (SSAP score lt80 betweengroups)
Go to
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTUREEVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURALCONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealedfamilies where significant changes in the structures had occurred in some cases 5-fold differences in the sizes of domains were identified and sometimes it wasapparent that the lsquofoldsrsquo of these very diverse relatives had effectively changedTherefore in these superfamilies more than one fold group can be identifiedeffectively breaking the hierarchical nature of the CATH classification which impliesthat each relative within a CATH homologous superfamily should belong to thesame CAT fold group
In addition an lsquoall versus allrsquo HMMndashHMM scan between all superfamily
representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E -valuesStructure comparison had failed to detect the relationship between thesesuperfamilies because the structural divergence of the relatives was so extremesometimes constituting a change in architecture as well as fold group In these caseshomology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature In orderto capture information on these distant homologies links have been created betweenthe superfamilies both on our web pages and in the CATH database The data can
now be found as a link from the CATH homepage (httpwwwcathdbinfo)In the near future we also plan to provide web pages presenting cases of significantstructural overlaps between superfamilies or fold groups For these cases we are notcurrently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4357
Go to
FEATURES
The CATH database can be accessed at httpwwwcathdbinfo The web interface
may be browsed or alternatively searched with PDB codes or CATH domainidentifiers There is also a facility for keyword searches With the version 30 release we now make the raw and processed data files available which include for exampleCATH domain PDB files sequences dssp files and they can be accessed through theCATH database main page The Gene3D resource can be accessed through the CATHdatabase or directly at httpwwwcathdbinfoGene3D The DHS can be accessedthrough the CATH database or directly at httpwwwcathdbinfobsmdhs
Structural similarity is assessed using an automatic method (SSAP) (34) which scores 100 for
identical proteins and generally returns scores above 80 for homologous proteins More distantly
related folds generally give scores above 70 (Topology or fold level) though in the absence of any
sequence or functional similarity this may simply represent examples of convergent evolutionreinforcing the hypothesis that there exists a limited number of folds in nature (56)
Abstract
We report the latest release (version 14) of the CATH protein domains database
(httpwwwbiochemuclacukbsmcath) This is a hierarchical classification of 13 359 proteindomain structures into evolutionary families and structural groupings We currently identify 827
homologous families in which the proteins have both structual similarity and sequence andor
functional similarity These can be further clustered into 593 fold groups and 32 distinct architectures
Using our structural classification and associated data on protein functions stored in the database (EC
identifiers SWISS-PROT keywords and information from the Enzyme database and literature) wehave been able to analyse the correlation between the 3D structure and function More than 96 of
folds in the PDB are associated with a single homologous family However within the superfolds
three or more different functions are observed Considering enzyme functions more than 95 of clearly homologous families exhibit either single or closely related functions as demonstrated by the
EC identifiers of their relatives Our analysis supports the view that determining structures for
example as part of a structural genomicslsquo initiative will make a major contribution to interpretinggenome dataPrevious SectionNext Section
Introduction FOR CATH
The CATH classification of protein domain structures was established in 1993 (1) as a
hierarchical clustering of protein domain structures into evolutionary families and structural
groupings depending on sequence and structure similarity There are four major levels
corresponding to protein class architecture topology or fold and homologous family (Fig 1)
Since 1995 information about these structural groups and protein families has been accessible
over the Web (httpwwwbiochemuclacukbsmcath ) together with summary information
about each individual protein structure (PDBsum) (2)
CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships
At the lowest levels in the hierarchy proteins are grouped into evolutionary families
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4457
(Homologous familes) for having either significant sequence similarity (ge35 identity) or high
structural similarity and some sequence similarity (ge20 identity)
TheCATH database of protein structures contains approximately 18000 domains
organized according to their (C)lass (A)rchitecture (T)opology and (H)omologous
superfamily Relationships between evolutionary related structures (homologues) within the
database have been used to test the sensitivity of various sequence search methods in order to
identify relatives in Genbank and other sequence databases Subsequent application of the most
sensitive and efficient algorithms gapped blast and the profile based method Position Specific
Iterated Basic Local Alignment Tool (PSI-BLAST) could be used to assign structural data to
between 22 and 36 of microbial genomes in order to improve functional annotation and
enhance understanding of biological mechanism However on a cautionary note an analysis of
functional conservation within fold groups and homologous superfamilies in the CATH database
revealed that whilst function was conserved in nearly 55 of enzyme families function had
diverged considerably in some highly populated families In these families functional properties
should be inherited far more cautiously and the probable effects of substitutions in key functional
residues must be carefully assessed
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4557
Figure 1
Schematic representation of the (C)lass (A)rchitecture and (T)opologyfold levels in the CATH
database
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4657
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies
for the subtilisin family (CATH id 34050200) Tables display the PDB codes for non-identicalrelatives in the family together with EC identifier codes and information about the enzyme reactions
The multiple structural alignment shown has been coloured according to secondary structure
assignments (red for helix blue for strands)
The Architecture level in CATH groups proteins whose folds have similar 3D arrangements of secondary structures (eg barrel sandwich or propellor) regardless of their connectivity whilst the
top level Class simply reflects the proportion of α-helix or β-strand secondary structures Three
major classes are recognised mainly-α mainly-β and αminusβ since analysis revealed considerableoverlap between the α+β and alternating αβ classes originally described by Levitt and Chothia (7)
Before classification multidomain proteins are first separated into their constituent folds using a
consensus method which seeks agreement between three independent algorithms (8) Whilst the protocol for updating CATH is largely automatic (9) several stages require manual validation in
particular establishing domain boundaries in proteins for which no consensus could be reached and inchecking the relationships of very distant homologues and proteins having borderline fold similarity
Although there are plans to assign the more regular architectures automatically all architecture
groupings are currently assigned manually
A homologous family Dictionary is now available within CATH which contains functional datawhere available for each protein within a homologous family This includes EC identifiers SWISS-PROT keywords and information from the Enzyme database or the literature (Fig 2) Multiple
structure based alignments are also available coloured according to secondary structure assignments
or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (AETodd CAOrengo and JMThornton submitted
to Protein Engng ) The topology of each domain is illustrated by schematic TOPS diagrams
(httpwww3ebiacuktops 10)
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4757
Figure 3
CATH wheel plot showing the population of homologous families in different fold groupsarchitectures and classes The wheel is coloured according to protein class (red mainly-α green
mainly-β yellow αβ blue few secondary structures) The size of the outer wheel represents the
number of homologous families in CATH whilst each band in the outer wheel corresponds to a single
fold family The size of each fold bandlsquo therefore reflects the number of homologous families having
that fold It can be seen that most fold families contain a single homologous family The superfoldfamilies are shown as paler bands containing many homologous families The inner wheel shows the
population of homologous families in the different architectures
We have also recently set up a Web Server (11) which enables the user to scan the CATH databasewith a newly determined protein structure and identify possible fold similarities or evolutionaryrelationships There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST)
(12) to identify a probable fold for a new sequenceThe latest release of CATH (version 14 April 1998) contains 9342 protein chains from the PDB (13)
which divide into 13 359 domain folds Currently 32 different architectures are recognised Since the
last release three new architectures have been described including the five- bladed α-β propellorGrouping proteins on the basis of sequence structure and functional similarity gives 827 evolutionary
homologous families (H-level) Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593 different fold groups (T-level)
The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown
in Figure 3 It can be seen that several highly populated fold families which we describe as superfolds(6) as they support a diverse range of sequences and more than three different functions still account
for nearly 30 of non-homologous structuresPrevious SectionNext Section
Implications for Structural Genomics
As the sequence databases grow rapidly the need to interpret these sequences and assign functions to
specific genes becomes increasingly important Many techniques exist for matching protein sequencesand thereby inheriting functional information However for very distant homologues there is often no
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4857
detectable sequence similarity despite conservation of 3D structure and function For these cases
evolutionary relationships and thereby functions can only be assigned by comparing the structures
Therefore a number of structural genomics initiatives are being proposed (14) which aim to identify
all the folds in nature with the ultimate goal of being able to predict the function of a new protein from
its known or probable structure The important questions to ask are how many more folds do we need
to determine before we have the complete set and how confident can we be in assigning function between proteins having similar structures
In the current genomes on average only 30 ndash 46 of sequences can be assigned to a structural family
by recognising sequence similarity to a protein of known structure (1516) With only sim600 unique
structures currently in the PDB compared with sim20 000 sequence families it is clear that we still
need to determine many more structures if we are to understand biology at the molecular level
However analysis of recently deposited structural data is very revealingFigure 4a illustrates the
distribution of 2159 new structural domains classified in the 10 months from June 1997 to March
1998 A large proportion of these (79) were clearly homologous (ge30 identity) to proteins of known structure
Of the remaining 443 structures (Fig 4b) corresponding to newlsquo sequences we found only 8 were
novel folds the remainder resembling a previously determined structure Many of these 199 (45)
could be identified as clear homologues by having significant structure and sequence similarity (SSAP
ge80 and ge20 sequence identity) A further 169 (38) were probable homologues as although thesequence identity was below 20 they had functional similarity andor gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12) There
remained a further 40 (9) proteins which were analogous mdash ie they had the same fold as a previous
entry but neither the sequence nor the function gave definite evidence of a common ancestor
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 4957
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5057
At the homologous superfamily level in CATH a more detailed analysis of enzyme functions showed
that the majority of homologous enzyme families in CATH (gt90) contained proteins for which the
first three EC identifiers were the same Considering those families where homologues have
significant sequence identity (ge20) after structural alignment 95 were found to have a single EC
identifier whilst for families where proteins have more than 30 sequence similarity we observed
that 98 had a single EC code
Although assigning function on the basis of homology is common practice it is clear that some
caution should be exercised particularly where there is little or no sequence similarity There are also
some clear examples where homologues with significant sequence similarity perform different
functions The role of gene recruitmentlsquo is especially clear in the eye lens proteins which function as
enzymes in other cellular environments but which are used as structural proteins in this context (17)
The extent of such gene recruitmentlsquo and context-sensitive function is really not known at this time
For enzymes it is clear that catalytic function can change and evolve usually to act on a different but
related substrate Similarly within the lipocalin family (CATH id 24013010) several proteins arefound with very similar structures which bind different fatty acids in the same region at the base of
the β-barrel (eg retinol bilin biotin)
Nearly half of the homologous families where two or more different EC numbers were observed
belong to the superfolds This suggests that if a new protein is assigned to a superfold family more
caution should be used when inheriting functional information as there appears to be greater toleranceto changes in sequence and ultimately function for these families However it is interesting to note
that many of these were TIM barrel or Rossmann folds These are superfolds in which the substrate or
ligand commonly binds in the same place This is in the base of the β-barrel for the TIMs and at the
crossover of the polypeptide chain for the doubly wound Rossmann structures
Previous SectionNext Section
Assignment of Function Through Structure
One of the reasons for determining structures is to derive more information to facilitate the assignmentof function From our analysis of proteins in CATH we suggest that structural data can help to assign
function in several ways
i The structural data allow recognition of more distant homologues compared withsequence data mdash in our analysis 83 of structures with novel sequences could be assigned ashomologues in this way (note that such assignment of function is again subject to the caveats
imposed by gene recruitmentlsquo discussed above)
ii The structural data allows detailed inspection of the functional site mdash to suggest if and
how the function may have evolved For example if an enzyme has evolved to act on a
different substrate the binding site may reveal or at least suggest possible changes in the
substrate
iii For the superfolds similarity of structure does not necessarily mean similarity of
function However the active sitebinding sites are often conserved eg in the TIM barrel or
Rossmann fold structures the ligand always binds at the same end of the barrel or sheet
iv Some methods have already been developed and will increasingly be the focus of attention over the next few years which aim to predict function ab initio from structure For
example enzymes can often be identified by the presence of a major cleft which also locates
the active site (18) Similarly critical surface patches which are used for molecular
recognition in binding other proteins or ligands may be identified using knowledge-basedapproaches (1920)
In summary extrapolating the data from Figure 4 to a new genome we can expect that of the 54 ndash 70
of sequences which currently have no obvious sequence matches in the PDB we will find nearly 80 ndash 90 to be homologous to a known family using the structural data alone For the singlet folds this
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5157
will almost certainly reveal some clues to the function For the superfolds some folds will reveal
information on the functional class (eg enzyme for TIM barrels) or the location of the active site if
not the specific function Only 10 ndash20 will be expected to be novellsquo folds For these the ab
initio methods referred to above may provide some clues to guide experiments Therefore it is clear
that determining structures as part of a structural genomicslsquo initiative for example will make a
major contribution to interpreting genome data
Jail
Just another interface library
Interfaces of macromolecules are a valuable basis to analyse the processof molecular recognition JAIL classifies not only the interfaces betweendomain architectures but also those between protein chains and thosebetween proteins and nucleic acids
Of course not
Gt-alphaGi-alpha chimera (PDB-ID 1GOT) Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base
Nevertheless it is essential to analyse the interacting parts of the proteins to understand the
process of protein-protein docking
To overcome this problem we have built up the JAIL database Since interacting domains
exhibit similiar structural features than proteins all known interfaces between interacting
domains of the SCOP database were extracted and classified in JAIL
Only a part of all protein structures are included in SCOP Particularly new PDB entries are
not yet annotated To overcome this problem additionally all interfaces between protein
chains were calculated and included in the database This type of interface also comprises
the interacting parts of the assumed biological units The last important type of interfaces
provided here is composed of the interacting parts between proteins and nucleic acids
Overall the data set consists of about 180000 interfaces
JAIL is a comfortable tool to browse through the interface library and to analyze single
interfaces However more general questions require large-scale analysis For this purpose
a detailed form enables the compiling of comprehensive non redundant data sets for
download
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5257
How is an interface defined
A complete residue is part of an interface if at least one atom of the aminoacid is located
within a range of 45 Angstroem of any atom of the interacting domain or chain One part of
an interface must consist of at least 5 C-alpha atoms in the case of protein chains In thecase of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms
of the RNADNA-backbone
What are biounits
The primary coordinate file deposited in the PDB generally contains one asymmetric unit
The asymmetric unit is the smallest portion of a crystal structure to which crystallographic
symmetry can be applied to generate one cell The biological molecule (biounit) is believed
to be the functional unit of the protein Frequently those units can be assumed or calculatedwhen additional information is available The biological units of many proteins are deposited
in a separate section of the PDB database and can be used for interface calculations More
information about biounits
In which way are redundant interfaces excluded in the download section
The redundancy is excluded in two different ways by structure and by sequence The
sequential clustering is based on the Cd-hit program The structural clustering is defined by
the protein families and superfamilies of the SCOP classification The database classifies
proteins by domain architecture
Which settings in the download section are best for my own research
The selection of the datasets depends on the type of interactions (protein-protein or protein-
nucleic acids) and the level of diversity that is desired Sequence identity of maximal 50
results in a higher diversity than the setting to 95 The default settings include interfaces of
domain-domain interactions as well as interfaces between interacting chains All interfaces of
chains that were already treated by the SCOP domain interfaces are excluded by default
This procedure results in a high number of interfaces that are still diverse enough for
statistical analysis
What is meant by show conservation in Jmol
The conservation of protein sequences is defined by the mutation rates at each amino acid
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5357
position For JAIL this information was retrieved from ConSurf ConSurf is a derived
database merging structural and sequence information
Database scheme
Search for
PDB-Id eg 1aay
SCOP-Id eg d1az0a_
EC-Number eg 311
Accession number eg P03697
Protein name eg capsid protein
Search in the following interface types
DomainDomain (SCOP)
ChainChain
ProteinNucleic
BiounitBiounit
None
Fulltext search
Keyword
Search
Clear
SCOP text search
Keyword
Search
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5457
Consider only interfaces having the following location
intra inter dont care
Search
Clear
MMDB
Experimentally resolved structures of proteins RNA and DNA derived from the Protein DataBank (PDB) with value-addedfeatures such as explicit chemical graphs computationally
identified 3D domains (compact substructures) that are used to identify similar 3D structures as well as links to literature similar sequences information about chemicals bound to thestructures and more These connections make it possible for example tofind 3D structuresfor homologs of a protein sequence of interest then interactively view the sequence-structurerelationshipsactive sites bound chemicals journal articles and more
Three-dimensional structures are now known within many protein families and it is
quite likely in searching a sequence database that one will encounter a homolog
with known structure The goal of Entrezrsquos 3D-structure database is to make this
information and the functional annotation it can provide easily accessible tomolecular biologists To this end Entrezrsquos search engine provides three powerful
features (i) Sequence and structure neighbors one may select all sequences similar
to one of interest for example and link to any known 3D structures (ii) Links
between databases one may search by term matching in MEDLINE for example and
link to 3D structures reported in these articles (iii) Sequence and structure
visualization identifying a homolog with known structure one may view molecular-
graphic and alignment displays to infer approximate 3D structure In this article we
focus on two features of Entrezrsquos Molecular Modeling Database (MMDB) not
described previously links from individual biopolymer chains within 3D structuresto a systematic taxonomy of organisms represented in molecular databases and
links from individual chains (and compact 3D domains within them) to structure
neighbors other chains (and 3D domains) with similar 3D structure MMDB may be
accessed athttpwwwncbinlmnihgoventrezqueryfcgidb=Structure
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5557
SUPERFAMILY is a database of structural and functional annotation for all
proteins and genomes[1][2][3][4][5]
The SUPERFAMILY annotation is based on a collection of hidden Markov models which
represent structural protein domains at the SCOP superfamilylevel[6]
A superfamily groups
together domains which have an evolutionaryr elationship The annotation is produced by
scanning protein sequences from completely sequenced genomes against the hidden Markov
models
For each protein you can
Submit sequences for SCOP classification
View domain organisation sequence alignments and protein sequence details
For each genome you can
Examine superfamily assignments phylogenetic trees domain organisation lists and
networks
Check for over- and under-represented superfamilies within a genome
For each superfamily you can
Inspect SCOP classification functional annotation Gene Ontologyannotation InterPro
abstract and genome assignments
Explore taxonomic distribution of a superfamily across the tree of life
All annotation models and the database dump are freely available for download to everyone
Contents
[hide]
1 Purpose
2 See also
3 References
4 External links
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains especially
into SCOP superfamilies The superfamilies are groups of proteins which have structural
evidence to support a common evolutionary ancestor but may not have detectable
sequence homology
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5657
Major Features
Sequence
search
Submit your protein or DNA sequence for SCOP superfamily and family
level classification
Keyword
search
Search for superfamily family or species names plus
sequence SCOP PDB or hidden Markov model IDs
Domain
assignments
Domain assignments alignments and architectures for completely
sequencedeukaryotic and prokaryotic organisms plus sequence
collections
Comparative
genomics
tools
Browse unusual (over- and under-represented) superfamilies and familiesadjacent domain pair lists and graphs unique domain pairs domain
combinations domain architecture co-occurrence networks and domain
distribution across taxonomic kingdoms for each organism
Genomestatistics
For each genome number of sequences number of sequences withassignment percentage of sequences with assignment percentage total
sequence coverage number of domains assigned number of superfamiliesassigned number of families assigned average superfamily size
percentage produced by duplication average sequence length averagelength matched number of domain pairs and number of unique domain
architectures
GeneOntology
Domain-centric Gene Ontology (GO) automatically annotated by HaiFang
Phenptype
Ontology
Domain-centric phenotypeanatomy ontology including Disease
Ontology Human Phenotype Mouse Phenotype Worm Phenotype Yeast
Phenotype Fly PhenotypeFly Anatomy Zebrafish Anatomy XenopusAnatomy Arabidopsis Plant
Superfamily
annotation
InterPro abstracts for 1052 superfamilies and Gene Ontology (GO)
annotation for 763 superfamilies
Functionalannotation
Functional annotation of SCOP 173 superfamilies by Christine Vogel
Phylogenetic
trees
Trees are generated using heuristic parsimony methods and are based on
protein domain architecture data for all genomes in SUPERFAMILY
Genome combinations or specific clades can be displayed as individual
trees
Similar
domain
architectures
Find the 10 domain architectures which are most similar to a domainarchitecture of interest
HiddenMarkov
models
Produce SCOP domain assignments for your sequences using theSUPERFAMILY models HMM visualisation by Martin Madera
eg model 0045110
Profile
comparison
Find remote domain matches when the HMM search fails to find asignificant match Profile comparison (PRC) for aligning and scoring two
profile hidden Markov models by Martin Madera
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment
7272019 Essential Info Notes-1
httpslidepdfcomreaderfullessential-info-notes-1 5757
Web services Distributed Annotation Server and linking to SUPERFAMILY
Downloads Sequences assignments models MySQL database and scripts - updatedweekly
Jump to [ SUPERFAMILY description middot Major features middot Top of page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference
The SUPERFAMILY database provides protein domain assignments at the SCOP superfamily
level for the predicted protein sequences in over 400 completed genomes A superfamily groups
together domains of different families which have a common evolutionary ancestor based on
structural functional and sequence data SUPERFAMILY domain assignments are generated
using an expert curated set of profile hidden Markov models All models and structural
assignments are available for browsing and download from httpsupfamorg The web interface
includes services such as domain architectures and alignment details for all protein assignments
searchable domain combinations domain occurrence network visualization detection of over- or
under-represented superfamilies for a given genome by comparison with other genomes
assignment of manually submitted sequences and keyword searches In this update we describe
the SUPERFAMILY database and outline two major developments (i) incorporation of family
level assignments and (ii) a superfamily-level functional annotation The SUPERFAMILY
database can be used for general protein evolution and superfamily-specific studies genomic
annotation and structural genomics target suggestion and assessment