protocols(for(representa/on(of(protein(domain(annota/ons(in...

Interpro ontology mapped into chado schema

cvterm_rela5onship table

Protocols for Representa/on of Protein Domain Annota/ons in Clade-‐Oriented Databases: a Case Study at the Legume Informa/on System using

Chado/Tripal Pooja E. Umale , Andrew D. Farmer

Na5onal Center for Genome Resources (NCGR), Santa Fe, NM 87505, USA

Introduc5on Methods Results

Interpro Consor/um Databases

PROSITE

HAMAP

PFAM

PRINTS

ProDom

SMART

TIGRFAMS

PIRSF

SUPERFAMILY

CATH-‐Gene3D

PANTHER

Input FASTA amino acid sequences

Score BLAST hits

Tokenize blast hits

Score the tokens (lexical

analysis)

Gene Ontology annota5on

Assign best scoring

descrip5on

Interpro is a searchable database that is used to elucidate protein func5on and annota5on for our project. InterproScan tool is used to scan query sequences against Interpro protein signature databases. We employed AHRD (h\ps://github.com/groupschoof/AHRD) to assign human readable descrip5ons to predicted proteins. Also for a be\er user experience and visualiza5on of protein domain annota5ons we incorporated in the context of the MSA view provided by jalview. Chado database, Drupal (open source content management system) and GMOD’s Tripal are the so`ware tools that were used for data storage and module/website development.

Acknowledgements

Web-‐based presenta5on of protein domain data and its annota5ons is made available at h\p://www.legumeinfo.org/search/protein_domains. We developed a shareable Tripal extension module for this purpose, enabling search by domains and interlinking our domain-‐oriented representa5on to other modules that showcase gene and gene families of legumes.

Gene family set sharing common domain

Chado Schema representa5on of InterproScan results

Example: Jalview display of Protein domain annota5ons on consensus sequence of a gene family

AHRD tool workflow

feature table (match$1_26_518) protein_hmm_match

domain feature feature_id organism_id uniquename

type_id

featureloc table

(for source feature -‐1 )

featureloc_id feature_id

srcfeature_id fmin fmax

featureloc table

(for source feature-‐2)

featureloc_id feature_id

srcfeature_id fmin fmax

organism table organism_id

genus species

cvterm table cvterm_id cv_id name

feature table (PF00221)

HMM representa5on of domain

feature_id

feature table (glyma.Glyma.10G209800.1)

Polypep5de feature

feature_id

Display of set of genes that have common domain

Protein domains can be conceptualized from a number of perspec5ves, from their role in defining an individual protein’s structure and func5on to their evolu5onary role in crea5ng novel molecular func5ons through duplica5on and recombina5on into unique mul5-‐domain protein architectures. Although many species-‐ and clade-‐oriented databases use standard protein domain analyses to characterize the puta5ve func5ons and cellular localiza5ons of the gene products represented in the genomes and transcriptomes of their species of interest, this is o`en limited to trea5ng the matched domains as proper5es of the genes that are simply an aid to their classifica5on and retrieval. While this gene-‐centric perspec5ve is clearly of great importance, eleva5ng domains to a prominent posi5on in the context of such databases has the poten5al to provide insights into many interes5ng biological ques5ons, from the role of domains in constraining and shaping intra-‐species diversity pa\erns (including SNPs, splice isoforms, and gene fusions) to their role in providing the basis for the defini5on of gene family groupings of orthologous and paralogous genes as well as providing insights into their evolu5onary dynamics. We have u5lized and extended a set of widely used open source tools for analysis, storage and web-‐based presenta5on of protein domain data to populate the Chado database underlying the Legume Informa5on System (h\p://legumeinfo.org) and to make this data available through a shareable Tripal extension module for enabling search by domains, exploi5ng the ontological structure of InterPro and interlinking our domain-‐oriented representa5on to other modules for presenta5on of gene and gene families.

Protein domain search page

dbxref table (IPR001106)

cvterm table (Aroma5c amino

acid lyase)

cvterm_id

dbxref_id

The InterPro protein families database: the classifica/on resource aEer 15 years. Nucleic Acids Research, Jan 2015; doi: 10.1093/nar/gku1243 InterProScan 5: genome-‐scale protein func/on classifica/on. BioinformaCcs, Jan 2014; doi:10.1093/bioinformaCcs/btu031 Waterhouse AM, Procter JB, Mar5n DMA, Clamp M, Barton GJ (2009) Jalview Version 2-‐a mul5ple sequence alignment editor and analysis w o r k b e n c h . B i o i n f o r m a 5 c s 2 5 : 1 1 8 9 -‐ 1 1 9 1 . doi:10.1093/bioinforma5cs/btp033 Ficklin S.P., Sanderson L.A., Cheng C.H., Staton M.E., Lee T., Cho I.H., Jung S., Be\ K.E., Main D. Tripal: a construc5on toolkit for online genome databases. Database. 2011:bar044. .

References/Publica5ons

Example

GFF file storing iprscan results

Methods

Introduc5on

Results

Future Direc5ons

•  Use of the ontology structure of interproscan to enhance searching •  display of intraspecific varia5on in the context of the domain

architecture (similar to how we are now displaying interspecific varia5on in the MSAs)

protocols(for(representa/on(of(protein(domain(annota/ons(in...

Documents