• Rice Proteins
• Data acquisition
• Curation
• Resources
• Development and integration of controlled vocabulary
• Gene Ontology
• Trait Ontology
• Plant Ontology
www.gramene.org
Rice Protein and Ontology Database
Objectives
– Annotation of rice proteins using Gene Ontology (GO) concepts of Molecular Function, Biological Process and Cellular Localization• 4,000 rice genes annotated during project• Leading to presentation of Rice Protein Database
(RPD) (http://www.gramene.org/perl/protein_search)
– Ontology• Contribute GO terms for monocot plants • Develop and curate vocabulary for
• plant anatomy • developmental stages• phenotypes or trait (TO-Trait Ontology)www.gramene.org
(PO-Plant Ontology)
Gene mining using the Controlled vocabulary
Protein
Morphology
AnatomyOr
Histology Cell
Sub-Cellular
Tissue
Root
Shoot
Seed
Meristematic
Vascular
Ground
Cell components
Pathways Reactions
Other roles
Enzyme
others
Localization
Molecular Function
Biological Process
Molecule
Traits (TO)
Organ
Cell type
Transcript
Gene
Development
Sub components
Sub components
Sub components
Sub components
Sub components
Sub components
Sub components
Agronomic
(PO)
PO
&
TO
GO
Organic
InorganicFats/carohydrates/proteins/mutagens/others
Internal CVOwww.gramene.org
Gene Ontology
Molecular function
Biological process
Cellular localization
Published report
-PubMed
-BIOSIS
-Others
Experimental evidence• Direct enzyme assay• Expression• Mutant/phenotype• Physical interaction• Complementation• Genetic interaction• Localization• Electronic-prediction• Citation• Sequence similarity
Electronic Curation information • Sequence similarity
•Clustal / BLAST• Traceable author statement• Predictions/identification
•Gen Ontology mapping•Gramene & Interpro (EBI)
•Pfam•PROSITE•PROTOMAP•Transmembrane helices•Cellular localization•Predictions based on HMM•Physiochemical properties•ProDom•3D-Structural alignments
• DBXref / References
GenBankSWISSPROTEMBL/DDBJOther databases
Sequence entry
Rice Protein database (RPD)
EnsEMBL Genome Browser
sequence
IEA and ISS
codes
Non IEA code
Link back
Plant Ontology
Anatomy & growth stages
Non IEA code
BLAT
Features on Peptide map
DBXrefs
Germplasm bank
Gramene Modules
www.gramene.org
• Name(s): Shows all the different names by which the molecule is represented in various databases and in scientific literature.
• E.C. Number(s): Shows the designated Enzyme Commission (E.C.) number. The EC numbers link to the GenomeNet, Japan, from where further links to biochemical pathways and Ligands are accessible
• Gene name(s): Lists all the gene names by which the molecule is called, as designated by the Commission on Plant Gene Nomenclature. If not available consider using a systematic name given to the ORF/Gene.
GenBank/SWISSPROT ENTRY
Get information on
Courtesy KEGG databasewww.gramene.org
Protein page
Accession number: Is the Swissprot accession number, also similar to the "AC" field from SWALL (EMBL) record and "ACCESSION" field of GenBank records for respective protein entry. Links the protein entry to the other databases namely, GenBank protein database, SWALL from EMBL and SWISS-PROT.
GenBank/SWISSPROT ENTRY
Get information on
Organism: Represents the taxonomic information on the organism from which the protein sequence was derived.
• Species: Shows the species of the Genus Oryza (presently represents 23 of 25 species)
• Subspecies: The subspecies indica or the japonica of the rice species Oryza sativa.
• Cultivar: Is the variety/cultivar name from which the sequence was derived and will link to a germplasm bank (GRIN/IRIS) for further information
www.gramene.org
Protein page
GenBank/SWISSPROT ENTRYPerform a “Blat” alignment of the Rice protein sequences from SWISSPROT and translated peptides from Ensembl Rice genome sequence database at Gramene.
The cut-off score used is 99% identity. The curator should validate. Add the features to the Protein structure - a map showing protein domains (e.g. Pfam) and protein features (trans-membrane, low complexity and coil regions) on the Ensembl peptide report page.
Sequence
Use it for performing analyses to identify features such as,Pfam / Prosite domains and generate predictions for trans-membrane helix, coiled coil regions, cellular component localization
Validation
Based on available CDS features and gene indices/ESTs
www.gramene.org
Map with features
Pro
tein
pag
e
Various tools used by Gramene in annotation of rice gene products
ftp://www.gramene.org/pub/gramene/protein/feature/Oryza_TMHMM_result.txt
Pfam members in RPD
Prosite members in RPD
www.gramene.org
• Annotate rice gene function using the Gene Ontology (GO) system
• Provide literature citations as evidence for assertion and classify them using the evidence codes
www.gramene.org
Rice Functional Information
Gene Ontology is a controlled vocabulary to define the following
concepts for a gene product
Molecular function: GO term(s) defining the molecular function of gene product
Biological process: GO term(s) defining the biological process
Cellular component: GO term(s) identifying the localization of the protein in a cell
After identifying a number of features, finally the curator proceeds to annotate gene product(s) in Rice Protein
Database
Gene Ontology (GO) Associations
IDA inferred from direct assay
Enzyme assays / in vitro reconstitution
immunofluorescence / cell fractionation
binding assay
IEA inferred from electronic annotation
Feature search / Interpro / Pfam / Prosite /
Annotations from database records
IEP inferred from expression pattern
Northerns / microarray data /
western blots
IMP inferred from mutant phenotype
Gene mutation / deletion or disruption /
over expression / ectopic expression
anti-sense experiments / RNAi
experiments / specific protein inhibitors
NR not recorded
Very old annotation
IGI inferred from genetic interactionSuppressor screens / synthetic lethal / functionalComplementation / rescue experiments
IPI inferred from physical interaction2-hybrid interactions/3-hybrid interactions co-purification / co-immunoprecipitation / affinity interaction
ISS inferred from sequence or structural similaritySequence similarity / Recognized domains / Structural similarity Southern blotting
NAS non-traceable author statementNo citation / non-traceable by curator
TAS traceable author statementreview article / text book / dictionary / website / database
A complete list is available at http://www.gramene.org/plant_ontology/evidence_code
s.html
EVIDENCE CODES APPLIED IN RICE PROTEIN DATABASE
www.gramene.org
The association of protein 1433_ORYSA with the GO term
Gene Ontology (GO) Associations
www.gramene.org
Pro
tein
pag
e
Gramene Ontology Database
The association of protein 1433_ORYSA with literature citation (EVIDENCE for molecular function)
www.gramene.org
Gene Ontology (GO) Associations
Gramene Literature Database
Pro
tein
pag
e
The association of protein 1433_ORYSA with the Literature citation and EVIDENCE CODES
Gene Ontology (GO) Associations
www.gramene.org
Pro
tein
pag
e
Total number of associations: 9866 (3321 gene products associated with 781 GO terms)
•Biological Process: 242 terms-2881 associations•Molecular Function: 449 term-5599 associations•Cellular Component: 90 terms-1386 associations
Total number of proteins: 8985Number of proteins from SWISSPROT: 397Number of proteins from TrEMBL: 8588
Total number of evidences: 21170Total number of IEA evidences: 20593Total number of non-IEA evidences: 577Total number of references as evidences: 74
5%
1%
2%
18%
8%
17%6%2%
6%
9%
2%
8%
2%
5%
2% 7%
electron transport
coenzyme metabolism
energy pathway
nucleic acid metabolism
phosphate metabolism
protein metabolism
carbohydrate metabolism
amino acid metabolism
catabolism
biosynthesis
stress related
transport
cell organization and biogenesis
cell cycle
oxygen and radical metabolism
cell communication
1%
1%
21%
5%
3%
1%
3%42%
22%
1%
signal transduction
enzyme regulator
carrier proteins
transporters
transcription regulator
storage protein
structural protein
defense/immunity
enzymes
nucleic acid binding
Biological process
Molecular function
Rice Protein Database (RPD) statistics-1
www.gramene.org
GO mappings are based on Interpro-EBI and Gramene curation
Total number of proteins in RPD: 8985
Number of proteins from SWISS-PROT: 397
Number of proteins from TrEMBL: 8588
Total number of correspondences
between proteins and translations: 7960 (6912 proteins correspond to 7957 translations)
Proteins have only one corresponding translation: 5911
Proteins have two corresponding translations: 959
Proteins have three corresponding translations: 37
Proteins have four corresponding translations: 5
Gene products associated with 781 GO terms: 3321 (refer to previous slide)
Number of Pfam entries: 874
Total number of proteins that have mappings to Pfam: 3663
Number of Prosite entries: 556
Total number of proteins that have mappings to Prosite: 3201
Total number of proteins that have mappings to trans-membrane features: 1583
www.gramene.org
Rice Protein Database (RPD) statistics-2
Trait Ontology (TO) to describe
Mutants/phenotypes in rice
www.gramene.org
www.plantontology.org
PLANT ONTOLOGY resources will be available soon
www.gramene.org
Future plans
www.gramene.org
• Continue annotation of rice proteins
• Identify the resources and tools to provide much improved annotation of rice proteins, using HMM’s, structure predictions and other tools.
• Develop tools to simplify the process of gene mining using Gramene and other databases by building combination search tools using controlled vocabulary and feature tables.
• Start building up a resource for creating a protein interaction map for the complete rice genome based on association in a biochemical pathway, assembly in a functional complex / interacting partners, proximity on the genome and common regulation mechanism (a possible collaboration).
• Contribute / share the controlled vocabulary for monocots with other databases
• Develop the necessary tools and host the resource pages for Plant Ontology Consortium
• Collaborate with Gene Ontology Consortium on various aspects of ontology development and curation