protocols(for(representa/on(of(protein(domain(annota/ons(in...

1
Interpro ontology mapped into chado schema cvterm_rela5onship table Protocols for Representa/on of Protein Domain Annota/ons in Clade Oriented Databases: a Case Study at the Legume Informa/on System using Chado/Tripal Pooja E. Umale , Andrew D. Farmer Na5onal Center for Genome Resources (NCGR), Santa Fe, NM 87505, USA Interpro Consor/um Databases PROSITE HAMAP PFAM PRINTS ProDom SMART TIGRFAMS PIRSF SUPERFAMILY CATHGene3D PANTHER Input FASTA amino acid sequences Score BLAST hits Tokenize blast hits Score the tokens (lexical analysis) Gene Ontology annota5on Assign best scoring descrip5on Interpro is a searchable database that is used to elucidate protein func5on and annota5on for our project. InterproScan tool is used to scan query sequences against Interpro protein signature databases. We employed AHRD (h\ps://github.com/groupschoof/AHRD ) to assign human readable descrip5ons to predicted proteins. Also for a be\er user experience and visualiza5on of protein domain annota5ons we incorporated in the context of the MSA view provided by jalview. Chado database, Drupal (open source content management system) and GMOD’s Tripal are the so‘ware tools that were used for data storage and module/ website development. Acknowledgements Webbased presenta5on of protein domain data and its annota5ons is made available at h\p://www.legumeinfo.org/search/protein_domains . We developed a shareable Tripal extension module for this purpose, enabling search by domains and interlinking our domainoriented representa5on to other modules that showcase gene and gene families of legumes. Gene family set sharing common domain Chado Schema representa5on of InterproScan results Example: Jalview display of Protein domain annota5ons on consensus sequence of a gene family AHRD tool workflow feature table (match$1_26_518) protein_hmm_match domain feature feature_id organism_id uniquename type_id featureloc table (for source feature 1 ) featureloc_id feature_id srcfeature_id fmin fmax featureloc table (for source feature2) featureloc_id feature_id srcfeature_id fmin fmax organism table organism_id genus species cvterm table cvterm_id cv_id name feature table (PF00221) HMM representa5on of domain feature_id feature table (glyma.Glyma.10G209800.1) Polypep5de feature feature_id Display of set of genes that have common domain Protein domains can be conceptualized from a number of perspec5ves, from their role in defining an individual protein’s structure and func5on to their evolu5onary role in crea5ng novel molecular func5ons through duplica5on and recombina5on into unique mul5domain protein architectures. Although many species and cladeoriented databases use standard protein domain analyses to characterize the puta5ve func5ons and cellular localiza5ons of the gene products represented in the genomes and transcriptomes of their species of interest, this is o‘en limited to trea5ng the matched domains as proper5es of the genes that are simply an aid to their classifica5on and retrieval. While this genecentric perspec5ve is clearly of great importance, eleva5ng domains to a prominent posi5on in the context of such databases has the poten5al to provide insights into many interes5ng biological ques5ons, from the role of domains in constraining and shaping intraspecies diversity pa\erns (including SNPs, splice isoforms, and gene fusions) to their role in providing the basis for the defini5on of gene family groupings of orthologous and paralogous genes as well as providing insights into their evolu5onary dynamics. We have u5lized and extended a set of widely used open source tools for analysis, storage and webbased presenta5on of protein domain data to populate the Chado database underlying the Legume Informa5on System (h\p:// legumeinfo.org ) and to make this data available through a shareable Tripal extension module for enabling search by domains, exploi5ng the ontological structure of InterPro and interlinking our domainoriented representa5on to other modules for presenta5on of gene and gene families. Protein domain search page dbxref table (IPR001106) cvterm table (Aroma5c amino acid lyase) cvterm_id dbxref_id The InterPro protein families database: the classifica/on resource aEer 15 years. Nucleic Acids Research, Jan 2015; doi: 10.1093/nar/gku1243 InterProScan 5: genomescale protein func/on classifica/on. BioinformaCcs, Jan 2014; doi:10.1093/bioinformaCcs/btu031 Waterhouse AM, Procter JB, Mar5n DMA, Clamp M, Barton GJ (2009) Jalview Version 2a mul5ple sequence alignment editor and analysis workbench. Bioinforma5cs 25: 11891191. doi:10.1093/bioinforma5cs/btp033 Ficklin S.P., Sanderson L.A., Cheng C.H., Staton M.E., Lee T., Cho I.H., Jung S., Be\ K.E., Main D. Tripal: a construc5on toolkit for online genome databases. Database. 2011:bar044. . References/Publica5ons Example GFF file storing iprscan results Methods Introduc5on Results Future Direc5ons Use of the ontology structure of interproscan to enhance searching display of intraspecific varia5on in the context of the domain architecture (similar to how we are now displaying interspecific varia5on in the MSAs)

Upload: others

Post on 10-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protocols(for(Representa/on(of(Protein(Domain(Annota/ons(in ......!!!!!Interpro!ontology!mapped!into!chado!schema cvterm_relaonship ! table! Protocols(for(Representa/on(of(Protein(Domain(Annota/ons(in(Clade6

           Interpro  ontology  mapped  into  chado  schema  

cvterm_rela5onship  table  

Protocols  for  Representa/on  of  Protein  Domain  Annota/ons  in  Clade-­‐Oriented  Databases:  a  Case  Study  at  the  Legume  Informa/on  System  using  

Chado/Tripal        Pooja  E.  Umale  ,  Andrew  D.  Farmer    

     Na5onal  Center  for  Genome  Resources  (NCGR),  Santa  Fe,  NM  87505,  USA                                        

Introduc5on   Methods   Results  

Interpro  Consor/um  Databases  

PROSITE  

HAMAP  

PFAM  

PRINTS  

ProDom  

SMART  

TIGRFAMS  

PIRSF  

SUPERFAMILY  

CATH-­‐Gene3D  

PANTHER  

Input  FASTA  amino  acid  sequences  

Score  BLAST  hits  

Tokenize  blast  hits    

Score  the  tokens  (lexical  

analysis)  

Gene  Ontology  annota5on  

Assign  best  scoring  

descrip5on  

Interpro   is   a   searchable   database   that   is   used   to   elucidate  protein   func5on   and   annota5on   for   our   project.  InterproScan   tool   is   used   to   scan   query   sequences   against  Interpro   protein   signature   databases.   We   employed   AHRD  (h\ps://github.com/groupschoof/AHRD)   to   assign   human  readable  descrip5ons  to  predicted  proteins.  Also  for  a  be\er  user   experience   and   visualiza5on   of   protein   domain  annota5ons  we  incorporated  in  the  context  of  the  MSA  view  provided   by   jalview.   Chado   database,   Drupal   (open   source  content   management   system)   and   GMOD’s   Tripal   are   the  so`ware  tools  that  were  used  for  data  storage  and  module/website  development.    

Acknowledgements  

Web-­‐based  presenta5on  of  protein  domain  data  and   its  annota5ons   is  made  available  at  h\p://www.legumeinfo.org/search/protein_domains.    We   developed   a   shareable   Tripal   extension   module   for   this   purpose,  enabling   search   by   domains   and   interlinking   our   domain-­‐oriented  representa5on  to  other  modules  that  showcase  gene  and  gene  families  of  legumes.    

Gene  family  set  sharing  common  domain  

Chado  Schema  representa5on  of  InterproScan  results  

Example:  Jalview  display  of  Protein  domain  annota5ons  on  consensus                            sequence  of  a  gene  family  

AHRD  tool  workflow  

feature  table  (match$1_26_518)  protein_hmm_match  

domain  feature    feature_id  organism_id    uniquename  

type_id  

 featureloc  table  

(for  source  feature  -­‐1  )    

featureloc_id    feature_id  

srcfeature_id  fmin  fmax    

 featureloc  table  

(for  source  feature-­‐2)    

featureloc_id    feature_id  

srcfeature_id  fmin  fmax    

organism  table  organism_id  

genus  species  

cvterm  table  cvterm_id    cv_id  name  

feature  table  (PF00221)  

HMM  representa5on  of  domain  

 feature_id  

feature  table  (glyma.Glyma.10G209800.1)  

Polypep5de  feature    

feature_id  

Display  of  set  of  genes  that  have  common  domain  

Protein  domains  can  be  conceptualized  from  a  number  of  perspec5ves,  from  their  role  in  defining  an  individual  protein’s  structure  and  func5on  to  their  evolu5onary  role  in  crea5ng  novel  molecular  func5ons   through   duplica5on   and   recombina5on   into   unique  mul5-­‐domain   protein   architectures.   Although  many   species-­‐   and   clade-­‐oriented   databases   use   standard   protein   domain   analyses   to  characterize   the   puta5ve   func5ons   and   cellular   localiza5ons   of   the   gene  products   represented   in   the   genomes   and   transcriptomes  of   their   species   of   interest,   this   is   o`en   limited   to   trea5ng   the  matched   domains   as   proper5es   of   the   genes   that   are   simply   an   aid   to   their   classifica5on   and   retrieval.  While   this   gene-­‐centric   perspec5ve   is   clearly   of   great   importance,   eleva5ng   domains   to   a  prominent  posi5on  in  the  context  of  such  databases  has  the  poten5al  to  provide  insights  into  many  interes5ng  biological  ques5ons,  from  the  role  of  domains  in  constraining  and  shaping  intra-­‐species  diversity  pa\erns   (including  SNPs,   splice   isoforms,  and  gene   fusions)   to   their   role   in  providing   the  basis   for   the  defini5on  of  gene   family  groupings  of  orthologous  and  paralogous  genes  as  well   as  providing  insights  into  their  evolu5onary  dynamics.  We  have  u5lized  and  extended  a  set  of  widely  used  open  source  tools  for  analysis,  storage  and  web-­‐based  presenta5on  of  protein  domain  data  to  populate  the  Chado  database  underlying  the  Legume  Informa5on  System  (h\p://legumeinfo.org)  and  to  make  this  data  available  through  a  shareable  Tripal  extension  module  for  enabling  search  by  domains,  exploi5ng  the  ontological  structure  of  InterPro  and  interlinking  our  domain-­‐oriented  representa5on  to  other  modules  for  presenta5on  of  gene  and  gene  families.  

Protein  domain  search  page  

dbxref  table  (IPR001106)  

cvterm  table  (Aroma5c  amino  

acid  lyase)  

cvterm_id  

dbxref_id  

The   InterPro  protein   families  database:   the  classifica/on   resource  aEer  15  years.  Nucleic  Acids  Research,  Jan  2015;  doi:  10.1093/nar/gku1243    InterProScan   5:   genome-­‐scale   protein   func/on   classifica/on.  BioinformaCcs,  Jan  2014;  doi:10.1093/bioinformaCcs/btu031    Waterhouse   AM,   Procter   JB,   Mar5n   DMA,   Clamp   M,   Barton   GJ   (2009)  Jalview   Version   2-­‐a   mul5ple   sequence   alignment   editor   and   analysis  w o r k b e n c h .   B i o i n f o r m a 5 c s   2 5 :   1 1 8 9 -­‐ 1 1 9 1 .  doi:10.1093/bioinforma5cs/btp033    Ficklin  S.P.,  Sanderson  L.A.,  Cheng  C.H.,  Staton  M.E.,  Lee  T.,  Cho  I.H.,  Jung  S.,   Be\   K.E.,   Main   D.   Tripal:   a   construc5on   toolkit   for   online   genome  databases.  Database.  2011:bar044.  .    

References/Publica5ons  

Example  

GFF  file  storing  iprscan  results  

Methods  

Introduc5on  

Results  

Future  Direc5ons  

•  Use  of  the  ontology  structure  of  interproscan  to  enhance  searching    •  display  of  intraspecific  varia5on  in  the  context  of  the  domain  

architecture  (similar  to  how  we  are  now  displaying  interspecific  varia5on  in  the  MSAs)