database talk for bits & bites meeting

59
Database talk for Bits & Bites meeting Jill Wegryzn Department of Plant Sciences University of California at Davis

Upload: keith-bradnam

Post on 24-Jan-2015

3.831 views

Category:

Education


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Database talk for Bits & Bites meeting

Database talk for Bits & Bites meeting

Jill WegryznDepartment of Plant Sciences

University of California at Davis

Page 2: Database talk for Bits & Bites meeting

Forest Genomics (Conifers)

• Phylogenetic Representation – – None currently exists. The conifers (gymnosperms) are the oldest of the

major plant clades, arising some 300 million years ago. They are key to our understanding of the origins of genetic diversity in higher plants.

• Ecological Representation –– Conifers are of immense ecological importance, comprising the dominant

life forms in most of the temperate and boreal ecosystems in the Northern Hemisphere.

• Fundamental Genetic Information – – Reference sequences are the fundamental data necessary to understand

conifer biology and aid in guiding management of genetic resources.• Development of Genomic Technologies –

– The analytical and computational challenge of building a reference sequence for such large genomes will drive development of tools, strategies, and human resources throughout the genomics community.

Page 3: Database talk for Bits & Bites meeting

Existing and Planned Angiosperm Tree Genome SequencesSpecies Genome Size1 Number of

Genes2Status3

In Progress With Draft Assemblies

Populus trichocarpa Black Cottonwood 500 Mbp ~ 40,000 2.0 / 2.2

Eucalyptus grandis Rose Gum 691 Mbp ~36,000 1.0 / 1.1

Malus domestica Apple 881 Mbp ~26,000 1.0 / 1.0

Prunus persica Peach 227 Mbp ~28,000 1.0 / 1.0

Citrus sinensis Sweet Orange 319 Mbp ~ 25,000 1.0 / 1.0

Carica papaya Papaya 372 Mbp -

Amborella trichopoda Amborella 870 Mbp -

In Progress Or Planned – No Published Assemblies

Castanea mollisama Chinese Chestnut 800 Mbp -

Salix purpurea Purple Willow 327 Mbp -

Quercus robur Pedunculate Oak 740 Mbp -

Populus spp and ecotypes Various various -

Azadirachta indica Neem 384 Mbp -

1) Genome size: Approximate total size, not completely assembled. 2) Number of Genes: Approximate number of loci containing protein coding sequence.3) Status: Assembly / Annotation versions; http://www.phytozome.net/ ; http://asgpb.mhpcc.hawaii.edu/papaya/ ; http://www.amborella.org

;(purple willow – Http://www.poplar.ca/pdf/edomonton11smart.pdf ; Neem - (http://www.strandls.com/viewnews.php?param=5&param1=68

Page 4: Database talk for Bits & Bites meeting

Plant Genome Size Comparisons

0

5000

10000

15000

20000

25000

30000

35000

40000

0

1000

2000

3000 ArabidopsisOryzaPopulusSorghumGlycineZea

Pseudotsugamenziesii

Taxodiumdistichum

Piceaabies

Piceaglauca

Pinustaeda

Pinuspinaster

1C D

NA

con

tent

(M

b)

Pinus lambertiana

P. menziesii

Page 5: Database talk for Bits & Bites meeting

What can be discovered about a gene by a database search?

• Best to have specific informational goals:– Evolutionary information: homologous genes, taxonomic

distributions, allele frequencies, synteny, etc.– Genomic information: chromosomal location, introns,

UTRs, regulatory regions, shared domains, etc.– Structural information: associated protein structures, fold

types, structural domains– Expression information: expression specific to particular

tissues, developmental stages, phenotypes, diseases, etc.– Functional information: enzymatic/molecular function,

pathway/cellular role, localization, role in diseases

Page 6: Database talk for Bits & Bites meeting

Using a database

• How to get information out of a database:– Summaries: how many entries, average or extreme

values; rates of change, most recent entries, etc. – Browsing: getting a sense of the kind and quality of

information available, e.g. checking familiar records– Search: looking for specific, predefined information

• “Key” to searching a database:– Must identify the element(s) of the database that are of

interest somehow:• Gene name, symbol, location or other identifying information.• Sequences of genes, mRNAs, proteins, etc.• A crossreference from another database or database generated id.

Page 7: Database talk for Bits & Bites meeting

NCBI and Entrez

• One of the most useful and comprehensive database collections is the NCBI, part of the National Library of Medicine.– Home to GenBank, PubMed & many other familiar DBs.

• NCBI provides interesting summaries, browsers, and search tools

• Entrez is their database search interfacehttp://www.ncbi.nlm.nih.gov/Entrez

• Can search on gene names, chromosomal location, diseases, articles, keywords...

Page 8: Database talk for Bits & Bites meeting

Types of Databases

• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter

• Examples: GenBank (nr and nt), SNP, GEO

• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)

• Examples: Refseq, Plant Protein, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

Page 9: Database talk for Bits & Bites meeting
Page 10: Database talk for Bits & Bites meeting

NCBI is not all there is...• Links to non-NCBI databases (see also “Link Out”)

– Reactome for pathways (also KEGG)– HGNC for nomenclature– HPRD protein information– Regulatory / binding site DBs (e.g. CREB; some not linked)– IHOP (information hyperlinked over proteins)

• Other important gene/protein resources:– UniProt (most carefully annotated)– PDB (main macromolecular structure repository)– UCSC (best genome viewer & many useful ‘tracks’)– DIP / MINT (protein-protein interactions)– More: InterPro, MetaCyc, Enzyme, etc. etc.– Species Databses: TAIR, Gramene, MGI, Wormbase, Flybase. GDR, TreeGenes

• Alternatives– SRA versus DNANexus

Page 11: Database talk for Bits & Bites meeting

Flat Files

Characteristics:• Data is stored as records in regular files• Records usually have a simple structure and fixed

number of fields• For fast access may support indexing of fields in the

records• No mechanisms for relating data between files• One needs special programs in order to access and

manipulate the data

Page 12: Database talk for Bits & Bites meeting

• Most applications require that specific information can be quickly and efficiently retrieved

• Often critical that performance does not degrade as more entities are added

• Flat text files don’t always fulfill these requirements, especially when there are many entities and/or relationships

Limitations of Flat Files

Page 13: Database talk for Bits & Bites meeting

Relational Database

Characteristics:• Data is organized into tables: rows & columns• Each row represents an instance of an entity• Each column represents an attribute of an entity• Metadata describes each table column• Relationships between entities are represented by

values stored in the columns of the corresponding tables (keys)

• Accessible through Standard Query Language (SQL)

Page 14: Database talk for Bits & Bites meeting

Metadata & Data TableName Type Max Length Description

Name Alphanumeric 100 Organism name

Size Integer 10 Genome length (bases)

Gc Float 5 Percent GC

Accession Alphanumeric 10 Accession number

Release Date 8 Release date

Center Alphanumeric 100 Genome center name

Sequence Alphanumeric Variable Sequence

Organism

Name Size Gc Accession Release Center Sequence

Escherichia coli K12 4,640,000 50 NC_000913 09/05/1997 Univ. Wisconsin

AGCTTTTCATT…

Streptococcus pneumoniae R6

2,040,000 40 NC_003098 09/07/2001 Eli Lilly and Company

TTGAAAGAAAA…

Page 15: Database talk for Bits & Bites meeting

Relationships

• Used to connect tables• Field(s) that have the same value in the related tables• Organism.Accession=Gene.OAccession• Organism.Accession

– Unique– Primary key

• Gene.OAccession– Not unique– Secondary key

Page 16: Database talk for Bits & Bites meeting

Schema: Representation of Table Organization

Page 17: Database talk for Bits & Bites meeting

SQL

• ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems.

• SQL statements are used to retrieve and update data in a database.

• Includes:– Data Manipulation Language (DML)– Data Definition Language (DDL)

Page 18: Database talk for Bits & Bites meeting

DBMS Advantages

• Program-data independence• Minimal data redundancy• Improved data consistency & quality

– Access control– Transaction control

• Improved accessibility & data sharing• Increased productivity of application development• Enforced standards

Page 19: Database talk for Bits & Bites meeting

DBMS

• Software package for defining and managing a database.

• Examples:– Proprietary: MS Access, MS SQL Server, DB2,

Oracle, Sybase– Open source: MySql, PostgreSQL

Page 20: Database talk for Bits & Bites meeting

http://dendrome.ucdavis.edu

Page 21: Database talk for Bits & Bites meeting

TreeGenes DatabaseEncompasses Dendrome Resources, DendromePlone, TreeGenes Database &DiversiTree

• Nine modules to store and interrelate data for query and analysis in PostgreSQL• Direct resource for nearly 2,500 forest geneticists representing 800 organizations

worldwide. Over 6,000 unique visitors in December 2011.• Forest Geneticists Colleague module• Literature module• Transcriptome annotation pipeline and module• Comparative map module• Species module• Sequencing module• Primers module• Genotype/EST module• Phenotype/Expression module• Sample tracking module

Page 22: Database talk for Bits & Bites meeting

Genomic Resources678 Species Representing 77 Genus

Page 23: Database talk for Bits & Bites meeting

Generic Model Organism Database

Page 24: Database talk for Bits & Bites meeting

CMAP: Obtaining TreeGenes (TG) Accession Number

Add literature data and (first) map file

(optional) Add additional map files Obtain TGAccessionnumber!

Page 25: Database talk for Bits & Bites meeting

Individual features and their locations on map

List of features on map

Page 26: Database talk for Bits & Bites meeting

GMOD Genome Browser

Tracks can be reordered or hidden

as necessary

Search andSelect data source

Page 27: Database talk for Bits & Bites meeting

Douglas-firTranscriptome Resources in TreeGenes

Page 28: Database talk for Bits & Bites meeting

Gene Ontology

• Gene annotation system

• Controlled vocabulary that can be applied to all organisms (protein/RNA)

• Used to describe gene products

Page 29: Database talk for Bits & Bites meeting

= bud initiation

Metazoa

= bud initiation

Saccharomyces

= bud initiation

Viridiplantae

Page 30: Database talk for Bits & Bites meeting

What’s in a name?

• The same name can be used to describe different concepts

Page 31: Database talk for Bits & Bites meeting

What’s in a name?

• Glucose synthesis• Glucose biosynthesis• Glucose formation• Glucose anabolism• Gluconeogenesis

• All refer to the process of making glucose from simpler components

Page 32: Database talk for Bits & Bites meeting

How does GO work?

• What does the gene product do?• Why does it perform these activities?• Where does it act?

What information might we want to capture about a gene product?

Page 33: Database talk for Bits & Bites meeting

• Molecular Function = elemental activity/task– the tasks performed by individual gene products; examples are carbohydrate

binding and ATPase activity

• Biological Process = biological goal or objective– broad biological goals, such as mitosis or purine metabolism, that

are accomplished by ordered assemblies of molecular functions

• Cellular Component = location or complex– subcellular structures, locations, and macromolecular complexes;

examples include nucleus, telomere, and RNA polymerase II holoenzyme

The 3 Gene Ontologies

Page 34: Database talk for Bits & Bites meeting

Ontologies can be represented as graphs, where the nodes are connected by edges

Nodes = concepts in the ontology Edges = relationships between the concepts

node

nodenode

edge

Ontology Structure

Page 35: Database talk for Bits & Bites meeting

Ontology Structure

• The Gene Ontology is structured as a hierarchical directed acyclic graph (DAG)

• Terms can have more than one parent and zero, one or more children

• Terms are linked by two relationships– is-a– part-of

Page 36: Database talk for Bits & Bites meeting

True Path Rule

• The path from a child term all the way up to its top-level parent(s) must always be true

cell cytoplasm

chromosome nuclear chromosome cytoplasmic chromosome mitochondrial chromosome

nucleus nuclear chromosome

is-a

part-of

Page 37: Database talk for Bits & Bites meeting

term: gluconeogenesis

id: GO:0006094

definition: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol.

What’s in a GO term?

Page 38: Database talk for Bits & Bites meeting

IEA Inferred from Electronic AnnotationISS Inferred from Sequence SimilarityIEP Inferred from Expression PatternIMP Inferred from Mutant PhenotypeIGI Inferred from Genetic InteractionIPI Inferred from Physical InteractionIDA Inferred from Direct AssayRCA Inferred from Reviewed Computational AnalysisTAS Traceable Author StatementNAS Non-traceable Author StatementIC Inferred by CuratorND No biological Data available

Source of Ontology Assignments

Page 39: Database talk for Bits & Bites meeting

Ontology DevelopmentPlant Ontology and Trait Ontology

• Plant Ontology– Structure

• Needle, Cambium

– Growth stages

• Trait Ontology– Forest Tree Specific Phenotypes

• Wood Density

• PATO– Phenotypic Qualities

Page 40: Database talk for Bits & Bites meeting

Currently Ontology Listings:OBO Foundry

Page 41: Database talk for Bits & Bites meeting

Interwebs 101

• Web 1.0 – Hyperlinks• Web 2.0 – Interactivity, information sharing, user

centered design (wikis, blogs, social media)• Web 3.0 – Semantic Web

– Data focused– Answer the limitations of HTML– HTML describes documents and the links between them. RDF,

OWL, and XML, by contrast, can describe specific things– Machine-readable data and relationships between the data –

knowledge processing – deductive reasoning and inference

Page 42: Database talk for Bits & Bites meeting

Web Services DevelopmentCommunication within TreeGenes

• Development of Web Services in cooperation with NSF’s iPlant Cyberinfrastructure Project– Software system to support interoperable machine to

machine interaction over a network regardless of platform incompatabilities

– Web service descriptive language (WSDL) is implemented to relate operations

Service Oriented Architecture (SOA)

Remote Procedure Call (RPC) Representational State Transfer (REST)

With SOAP, the basic unit of communication is a message

RPC Web services define a call interface which the basic unit is the WSDL operation.

REST use HTTP by constraining the interface to standard operations (like GET, POST, PUT, DELETE for HTTP). The focus is on interacting with stateful resources, rather than messages or operations.

Page 43: Database talk for Bits & Bites meeting

SSWAP OntologyCreating and Contributing to Existing Servlets for Common Genomic Types

Page 44: Database talk for Bits & Bites meeting
Page 45: Database talk for Bits & Bites meeting

Forest Tree Genetic Stock Center

Page 46: Database talk for Bits & Bites meeting

Bulk Retrieval Window Components

Bulk Retrieval WindowData & Annotation Selection Fields

Page 47: Database talk for Bits & Bites meeting

Accurately track samples through collection, DNA extraction, and genotyping

Provide a standard and efficient method to collect and store phenotypic data

Provide a public interface to readily query raw genotype, phenotype, and association results (DiversiTree)

Provide interfaces and database backend to support a DNA distribution center (UCD)

TreeGenes Sample Tracking System

Page 48: Database talk for Bits & Bites meeting

Population GeneticsAssociation Studies, Landscape Genomics

• Currently no other repositories to target association data with geo-referenced data• dbGAP• Dryad

• Starting with enforcement at the journal level: Tree Genetics and Genomes

Page 49: Database talk for Bits & Bites meeting
Page 50: Database talk for Bits & Bites meeting

login/signup panellogin/signup panel

data retrieval paneldata retrieval panel

tool selection paneltool selection panel

task queue paneltask queue panel

query sequence panelquery sequence panel

GenSAS development with Content ManagementPlone and Drupal

Page 51: Database talk for Bits & Bites meeting

evidence tracksevidence tracks

control trackcontrol track

function trackfunction trackcustom trackcustom track

sequence tracksequence track

overview trackoverview track

message boxmessage box

GenSAS developmentMultiple Gene Prediction Tracks

Page 52: Database talk for Bits & Bites meeting

GenSAS integration with GbrowsePrototyped with Peach Genome in GDR

Page 53: Database talk for Bits & Bites meeting

Analysis ResourcesCustom Databases

Page 54: Database talk for Bits & Bites meeting

Integrating Tools into TreeGenesGalaxy

Page 55: Database talk for Bits & Bites meeting

Genomic resources

Page 56: Database talk for Bits & Bites meeting

Fluxes of CO2 and H20: FLUXNET and Ameriflux

Free Air CO2 Enrichment (FACE)

Page 57: Database talk for Bits & Bites meeting

TRY – Global Database of Plant Traits

• Scientists compiled three million traits for 69,000 out of the world's ~300,000 plant species.

• Worldwide collaboration of scientists from 106 research institutions • TRY is hosted at the Max Planck Institute for Biogeochemistry in Jena

(Germany)– Jointly coordinated with:

• University of Leipzig (Germany)• IMBIV-CONICET (Argentina)• Macquarie University (Australia)• CNRS and University of Paris-Sud (France)

Page 58: Database talk for Bits & Bites meeting
Page 59: Database talk for Bits & Bites meeting