bits: overview of important biological databases beyond sequences

44
Basic bioinformatics concepts, databases and tools Module 4 Beyond the sequences Dr. Joachim Jacob http://www.bits.vib.be Updated Nov 2011 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf

Upload: bits

Post on 18-Nov-2014

2.123 views

Category:

Education


0 download

DESCRIPTION

Module 4 Other relevant biological data sources beyond sequences Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training

TRANSCRIPT

Page 1: BITS: Overview of important biological databases beyond sequences

Basic bioinformatics concepts, databases and tools

Module 4

Beyond the sequences

Dr. Joachim Jacob

http://www.bits.vib.be

Updated Nov 2011http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf

Page 2: BITS: Overview of important biological databases beyond sequences

Module 4 broadens our view

Page 3: BITS: Overview of important biological databases beyond sequences

To understand life, we need not only sequences, but many other concepts

Bioinformatics is also storing and analyzing− gene information: variations, isoforms,...

− Expression data

− 3D protein structure data

− Interaction data

− Pathways and network

“Storing all relevant biological data”

Page 4: BITS: Overview of important biological databases beyond sequences

Schematic view II

GeneA sequence annotations – gene expr – pathway – struct,...

GeneB sequence annotations – gene expr – pathway – struct,...

GeneC sequence annotations – gene expr – pathway – struct,...

analysis

Primary databaseOther sequence databases

results

Additional information sources

results

Page 5: BITS: Overview of important biological databases beyond sequences

The indispensable databases

Gene Ontology – structuring KEGG – biochemical pathways PDB – Structure of proteins Intact – Interaction data dbSNP – database of genomic variation Expression sources – Microarray data

Page 6: BITS: Overview of important biological databases beyond sequences

Gene Ontology structures the way we communicate about life

http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax.pdf

http://www.arabidopsis.org/help/tutorials/go1.jsp

Gene translation Protein synthesisProtein production

Page 7: BITS: Overview of important biological databases beyond sequences

Gene Ontology structures life

http://www.geneontology.org/

Agreement on standardized keywords (often referred to as 'controlled vocabularies'), describing all natural processes in an hierarchical way (ontology).

Keywords are assigned to genes based different evidence

Keywords are ordered in a hierarchical tree-like structure ( 'directed acyclic graphs')

Three GO 'trees' exists, describing:

"Biological Process"

"Cellular Component"

"Molecular Function"

http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax.pdf

http://www.arabidopsis.org/help/tutorials/go1.jsp

Page 8: BITS: Overview of important biological databases beyond sequences

A gene can be given different GO terms

Example, cytochrome c:

molecular function: oxidoreductase activity,

biological process: oxidative phosphorylation and induction of cell death,

cellular component: mitochondrial matrix and mitochondrial inner membrane.

In each tree, the terms are organised in a directed acyclic graph: a network consisting of parents and child-terms (as nodes) and lines between them as relationships.

Page 9: BITS: Overview of important biological databases beyond sequences
Page 10: BITS: Overview of important biological databases beyond sequences

Different evidence codes can assign a degree of confidence to the assignment

http://www.geneontology.org/GO.evidence.shtml

Evidence codes can be grouped by: Experimental (e.g. IDA – inferred from direct assay)

Computational analysis

Author statement

Curator statement

Inferred from electronic annotation (IEA)

If available, each annotation has also a reference

Page 11: BITS: Overview of important biological databases beyond sequences

Different evidence codes can assign a degree of confidence to the assignment

Page 12: BITS: Overview of important biological databases beyond sequences

Gene Ontology structures all genes according to their biological significance

The GO structure and the terms can be browsed by a browser called AmiGO.

The Quick Go from EBI has some nice visualisation

Excellent GO-wiki for all your questions

Page 13: BITS: Overview of important biological databases beyond sequences

GO can be used to retrieve all gene (products) related to one specific term

You can search broad, e.g. Amigo search for Diabetes leads to following GO term

http://amigo.geneontology.org/

Page 14: BITS: Overview of important biological databases beyond sequences

GO can be used to retrieve all gene (products) related to one specific term

Amigo search for Diabetes

Page 15: BITS: Overview of important biological databases beyond sequences

GO can be used to retrieve all gene (products) related to one specific term

Amigo search for Diabetes

Page 16: BITS: Overview of important biological databases beyond sequences

GO is also useful to analyze and compare different gene lists

A lot of tools on GO are available on website.

http://www.geneontology.org/GO.tools.shtml

Page 17: BITS: Overview of important biological databases beyond sequences

Some things to know about GO

For analyses, one can make use of 'shrinked' GO sets, the so-called GO-slims

– GO slims are a subset of biologically more relevant GO terms (available per species)

– GO ontologies can be downloaded in .obo format.

Not all information is captured by GO and need to be retrieved in other databases

Metabolic pathways: KEGG, …

Phenotype/diseases

• Mapping files exists e.g. kegg2go

http://www.geneontology.org/GO.slims.shtml

Page 18: BITS: Overview of important biological databases beyond sequences

Biological pathways databases organise genes by molecular reactions

3 important databases on biological pathways

http://www.kegg.jp/

http://www.reactome.org/ - EBI

http://metacyc.org

Page 19: BITS: Overview of important biological databases beyond sequences

Proteins with enzymatic function receive an Enzyme Commission (EC) number

http://www.chem.qmul.ac.uk/iubmb/enzyme/

EC 6 Ligases

EC 5 Isomerases

EC 4 Lyases

EC 3 Hydrolases

EC 2 Transferases

EC 1 Oxidoreductases

Page 20: BITS: Overview of important biological databases beyond sequences

IntAct database contains interaction information of proteins

http://www.ebi.ac.uk/intact

Three types of interactions stored Protein-protein Protein-dna Protein-small molecule

Page 21: BITS: Overview of important biological databases beyond sequences

IntAct database represents all interactions as binary: caution!

Page 22: BITS: Overview of important biological databases beyond sequences

Interaction networks can be analysed on your computer using Cytoscape

Cytoscape training material on the BITS website

Page 23: BITS: Overview of important biological databases beyond sequences

PDB hosts 3-dimensional structural data on molecules

Page 24: BITS: Overview of important biological databases beyond sequences

PDB hosts 3-dimensional structural data on molecules

PDB = Protein DataBankhttp://www.pdb.org/pdb/home/home.do

Only structures resolved through NMR and X-ray (or other accurate techniques)

Proteins DNA RNA Ligands

Understanding PDB data: tutorial

Page 25: BITS: Overview of important biological databases beyond sequences

PDB files can be read by a lot of different tools to display the structure

Every entry in PDB contains its own PDB accession number (often 1 digit and three letters)

The PDB file contains 3D coordinates from every single atom in the structure, together with variability of that position (last two digits)

http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-analysis-training&catid=81:training-pages&Itemid=190

Page 26: BITS: Overview of important biological databases beyond sequences

PDB files can be read by a lot of different tools to display the structure

Tools to visualize (and some to analyze structures) (see BITS wiki)

http://www.bits.vib.be/wiki/index.php/Protein_structure

Page 27: BITS: Overview of important biological databases beyond sequences

To find a structure for your protein sequence is to search for similarity

Homology modeling

Similarity on sequence level projected to a structure Blast your query against PDB db by cblast , or at expasy

PSI-BLAST - can detect sequences with similar structures (twilight zone!)

If still no success: 3D-jury (a meta approach, including fold recognition and local structure prediction)

Similarity on structural level: aligning structures VAST (structure)

Distance mAtrix aLIgnment DALI

http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdfhttp://consurf.tau.ac.il/pe/protexpl/psbiores.htm

BITS training on protein structure analysis

Tools at EBI

Page 28: BITS: Overview of important biological databases beyond sequences

Structural information is used to classify proteins

SCOP

Groups proteins based on evolutionary, domain architecture and structural information.

CATH

Manually curated classification on protein domains

Database cross-references in PDB entry

http://scop.mrc-lmb.cam.ac.uk/scop/http://www.cathdb.info/

Page 29: BITS: Overview of important biological databases beyond sequences

dbSNP is a public-domain archive for simple genetic polymorphisms

Single Nucleotide Polymorphism database (NCBI)

Each dbSNP entry has a code rsxx (RefSNP) or ssxx (submitted SNP) single-base nucleotide substitutions (also known as

single nucleotide polymorphisms or SNPs),

small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs)

retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs).

Synchronized with new genome builds

Page 30: BITS: Overview of important biological databases beyond sequences

Expression data can be sequence-based or hybridisation-based

Sequence-based (ESTs - RNA seq - SAGE)

Digital gene expression/northern

Microarray databases – hybridisation based: GEO: gene expression omnibus (NCBI)

− Platform: GPLxxxxxxx

− Experiment: GSExxxxxx (= several samples)

− Sample: GSMxxxxxxxx

− Some experiments are curated: GDSxxxxx (online analysis possible)

ArrayExpress (EBI)

Page 31: BITS: Overview of important biological databases beyond sequences

Example of expression data at GEO

Page 32: BITS: Overview of important biological databases beyond sequences

Example of expression data at GEO

Page 33: BITS: Overview of important biological databases beyond sequences

Example of expression data at GEO

Page 34: BITS: Overview of important biological databases beyond sequences

Example at ArrayExpress

Page 35: BITS: Overview of important biological databases beyond sequences

Example at ArrayExpress

Page 36: BITS: Overview of important biological databases beyond sequences

Entrez interconnects the databases at NCBI for easy querying

UniGene : sequences grouped by gene PopSet : sequence alignments for population

studies and phylogeny Structure : 3D structures (PDB) Genome : genomic maps of chromosomes and

plasmids UniSTS (Sequence Tagged Sites) PubMed : literature abstracts (MEDLINE,…) OMIM (Online Mendelian Inheritance in Man) :

literature reviews, Mesh (Medical Subject Headings) : keywords Taxonomy

Page 37: BITS: Overview of important biological databases beyond sequences

Finding relevant data

Page 38: BITS: Overview of important biological databases beyond sequences

Summarizing most important links to discover everything you need ...

Protein dataInterpro (heavily integrated with EBI resources)

http://www.interpro.org

Gene dataEntrez at NCBI : 'Entrez Gene'

http://www.ncbi.nlm.nih.gov/Entrez/

Ebeye Search at EBI : excellent for cross-species

http://www.ebi.ac.uk/ebisearch/

Page 39: BITS: Overview of important biological databases beyond sequences

Hold back your horses!

Phew, where do I place this all?

Page 40: BITS: Overview of important biological databases beyond sequences

Bioinformatics is all about different data, as versatile as life itself

Due to the strong cross-references between different databases, new databases and relevant info are rapidly integrated in existing databases.

You can discover them by taking time to read the entries.

Page 41: BITS: Overview of important biological databases beyond sequences

New tools are emerging everyday to enable you to browse all data sources...

BioGPS, all in one window!

Page 42: BITS: Overview of important biological databases beyond sequences

New tools are emerging everyday to enable you to browse all data sources...

Page 43: BITS: Overview of important biological databases beyond sequences

Integrative resources are increasingly being organised on a species basis

EMAGE database of in situ gene expression in mouse

OMIM Database of diseases in man

Websites providing an interface to integrate all this data is increasingly important

Often organized on a species basis− TAIR

− Flybase

− Wormbase

Page 44: BITS: Overview of important biological databases beyond sequences

The organizing biological data information by species

By species, why?

There is one biological information resource which stays

more or less unchanged per species ...