gene expression warehousing in leipzigdbs.uni-leipzig.de/files/projekte/bioinf/geware_poster.pdf ·...

1
Gene Expression Warehousing in Leipzig Gedruckt im Universitätsrechenzentrum Leipzig System architecture Milestones and perspective January 2002 April/May 2002 June/July 2002 October 2002 December 2002 Project Start, Begin Step 1 Results: Concept warehousing, Approach for a data model; Begin Step 3 Result: Installed Hard- and Software, Begin Step 4 Result: Complete load manage- ment and data access Begin Step 6 Begin Step 7 Dr. Knut Krohn 3 , Markus Eszlinger 3 Prof. Dr. Ralf Paschke 3,4 3 Interdisciplinary Centre for Clinical Research 4 Medical Department III http://www.uni-leipzig.de/~izkf/ http://www.uni-leipzig.de/~innere/ Motivation q Different technologies to detect gene expression § EST clustering, Microarray, SAGE, etc. q (Qualitative/quantitative) gene expression analysis: answers to many questions § determination of gene functions as genome response to changes of environmental conditions, developmental stages, or between different tissues, etc. § co-expression of genes § discovery of new genes q Microarray: most promising technology for gene expression analysis, simultaneous study of thousands of genes, producing huge amounts of valuable data with every single experiment q Data management: major bottleneck in applications for gene expression analysis; impact on effectiveness of microarray experiments q Open problems: § integrating data on experiments with existing, publicly available data, such as sequences, pathways § integrating gene expression data produced by different technologies, experiments § flexible analysis and data mining approaches (statistical evaluation, detection of relevant patterns, ...) Molecular biology research at University of Leipzig q 15 user groups with different research focus; all requiring gene expression analysis § Change detection in signal transduction in thyroid pathologies § Gene expression profiling of brain tumors § Comparative genomics for different primates q Technologies: Affymetrix Oligonucleotide Microarrays (Chip fabrication, Wash, Scan, Data management and analysis) q Experiments: about 300-500 experiment series / years q Current situation: § Service center responsible for array processing and distribution of array results (raw data or Excel sheets) to respective users § Data analysis locally by single users § Limited data management and analysis capabilities provided by standard software (Affymetrix) Related work q Microarray Gene Expression Database (MGED) Group. International consortium, suggestion of standards for storing and representing data on gene expression experiments q Several databases for gene expression data, RDBMS-based, publicly accessible through WWW, high detail level: image and related information for single array spots: § ArrayDB (NHGRI): Sybase. Covering the entire process of array fabrication, wash and scan. Cy3 / Cy5 glass slide data. Probe sequences are linked with external databases, e.g. UniGene, KEGG, using their corresponding identifier. Flexible query tool with support for cross-experiment analysis based on visualization. No clustering algorithms. § GeneX (NCGR): PostgresSQL / Sybase. Integration with external databases (dbEST, KEGG) through hyperlinks. Interoperability by means of GeneXML. Two clustering algorithms: hierarchical and permuation-based. Integrated tool for statistical analysis § Stanford Microarray Database (SMD – University of Stanford): Oracle. Cy3 / Cy5 glass slide data. Data submissions from different scientific publications. Hierarchical clustering and SOM. q Advantages: § Experiences from building databases for gene expression data § Open schemas; great help for constructing a customized data model meeting specific local needs q Drawbacks: § Specifically developed tools for data access and analysis. No ad-hoc query / reporting tools, or integration of commercial tools, e.g. for data mining § Limited analysis capabilities: analysis of data from single experiments; main reason: lack of a uniform experiment / sample annotation mechanism to identify comparable experiments. § No „real“ integration with external data, such as sequences, pathways; navigational access per hyperlinks § No (semantic) integration with other types of gene expression data, e.g. generated by EST cluster profiling References q Brazma, A. et al.: Minimum Information about a Microarray Experiment (MIAME) toward Standards for Microarray Data. Nature Genetics, Vol. 29, 2001 q Ermolaeva, Olga et al.: Data Management and Analysis for Gene Expression Arrays. Nature Genetics, Vol. 20, 1998 q Mangalam, H. et al.: GeneX: An Open Source Gene Expression Database and Integrated Tool Set. IBM System Journal, Vol. 40, No. 2, 2001 q Masys, Daniel: Database Design for Microarray Data. The Pharmacogenomics Journal, Vol. 1, No. 4, 2001 q Sherlock, G. et al.: The Standford Microarray Database. Nucleic Acids Research, Vol.29, No.1, 2001 Goals q Systematic management of expression data generated by (local) microarray experiments q Flexible support for different comparative gene expression studies q Uniform experiment/sample annotation, especially for cross-experiment analysis q Integration with gene expression data from other local (EST cluster profiling) projects q Integration with external data from public databases q Use of tools for analysis, data mining and visualization purposes Project steps 1. Requirement Analysis: Analysis of user requirements, especially functionality, report examples, security levels; Analysis of internal and external source systems; Determining a quantitative structure for daily / monthly / yearly data 2. Warehouse Modeling: Construction of a multi-dimensional data model for gene expression data, construction of ontologies to annotate genes (probe sets), experiments and sample sets; Implementation of the data warehouse in a commercial relational or object-relational DBMS 3. Tool Evaluation: Specification, order and installation of required Warehouse specific hardware (server, net, etc.) incl. software (OS, DBMS, Report tools, etc.); Decision „Make or Buy“ – individual software vs. standard software 4. Warehouse Population: Design and implementation of load management routines for expression data generated by local experiments and related data from public databases 5. Data Access: Design and implementation of user-friendly web interfaces for common analysis, data submission and administration tasks; Integration of commercial and other public domain tools for searching, querying, data mining 6. Warehouse Pilot and Test 7. Roll-out: Evaluation of different analysis procedures for different molecular biological studies Approach of a multidimensional data model Begin Step 2 time Begin Step 5 2001 Interdisciplinary Centre for Bioinformatics Leipzig Interdisciplinary Centre for Clinical Research Leipzig User Group A User Group B Laboratory (Service Center) experiment / tissue sample / chips (probe set) Result Set Result Set experiment / tissue sample / chips (probe set) {tkirsten, hong, dieter, rahm}@informatik.uni-leipzig.de / {krok, eszlingm, pasr}@medizin.uni-leipzig.de Toralf Kirsten 1 , Hong Hai Do 1 , Dr. Dieter Sosna 2 Prof. Dr. Erhard Rahm 1,2 1 Interdisciplinary Centre for Bioinformatics 2 Database Group http://bioinformatik.uni-leipzig.de http://dbs.uni-leipzig.de OLAP Server and Web Server ... Web Client Web Client DB Server Analysis: Data Mining, Statistical Evaluation, Biological OLAP Load Manager (application) file transfer file system on DB Server Data Warehouse central administration Brain Tumor Profiling Thyroid Signal Transduction Primates Comparative Genomics UniGene Database local file system IZKF Leipzig ... local file system Other Institutes file transfer Data Mart Data Mart Data Mart Pathway Databases dbEST Database non public sources public sources Staging Area Error Area Data Warehouse Source Systems

Upload: duongthuy

Post on 03-Feb-2018

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Gene Expression Warehousing in Leipzigdbs.uni-leipzig.de/files/projekte/BIOINF/geware_poster.pdf · Gene Expression Warehousing in Leipzig ... Probe sequences are linked with external

Gene Expression Warehousing in Leipzig

Gedruckt im Universitätsrechenzentrum Leipzig

System architecture

Milestones and perspective

January 2002 April/May 2002 June/July 2002 October 2002 December 2002

Project Start,Begin Step 1

Results:Concept warehousing,Approach for a data model;

Begin Step 3

Result:Installed Hard- and Software,

Begin Step 4

Result:Complete load manage-ment and data access

Begin Step 6

Begin Step 7

Dr. Knut Krohn3, Markus Eszlinger3

Prof. Dr. Ralf Paschke3,4

3Interdisciplinary Centre for Clinical Research 4Medical Department IIIhttp://www.uni-leipzig.de/~izkf/ http://www.uni-leipzig.de/~innere/

Motivationq Different technologies to detect gene expression

§ EST clustering, Microarray, SAGE, etc.

q (Qualitative/quantitative) gene expression analysis: answers to many questions§ determination of gene functions as genome response to changes of environmental conditions, developmental stages, or between

different tissues, etc.§ co-expression of genes§ discovery of new genes

q Microarray: most promising technology for gene expression analysis, simultaneous study of thousands of genes, producing huge amounts of valuable data with every single experiment

q Data management: major bottleneck in applications for gene expression analysis; impact on effectiveness of microarray experiments

q Open problems:§ integrating data on experiments with existing, publicly available data, such as sequences, pathways§ integrating gene expression data produced by different technologies, experiments§ flexible analysis and data mining approaches (statistical evaluation, detection of relevant patterns, ...)

Molecular biology research at University of Leipzigq 15 user groups with different research focus; all requiring gene expression analysis

§ Change detection in signal transduction in thyroid pathologies§ Gene expression profiling of brain tumors§ Comparative genomics for different primates

q Technologies: Affymetrix Oligonucleotide Microarrays (Chip fabrication, Wash, Scan, Data management and analysis)q Experiments: about 300-500 experiment series / yearsq Current situation:

§ Service center responsible for array processing and distribution of array results (raw data or Excel sheets) to respective users§ Data analysis locally by single users§ Limited data management and analysis capabilities provided by standard software (Affymetrix)

Related workq Microarray Gene Expression Database (MGED) Group. International consortium, suggestion of standards for storing

and representing data on gene expression experimentsq Several databases for gene expression data, RDBMS-based, publicly accessible through WWW, high detail level:

image and related information for single array spots: § ArrayDB (NHGRI): Sybase. Covering the entire process of array fabrication, wash and scan. Cy3 / Cy5 glass slide data. Probe

sequences are linked with external databases, e.g. UniGene, KEGG, using their corresponding identifier. Flexible query tool withsupport for cross-experiment analysis based on visualization. No clustering algorithms.

§ GeneX (NCGR): PostgresSQL / Sybase. Integration with external databases (dbEST, KEGG) through hyperlinks. Interoperability bymeans of GeneXML. Two clustering algorithms: hierarchical and permuation-based. Integrated tool for statistical analysis

§ Stanford Microarray Database (SMD – University of Stanford): Oracle. Cy3 / Cy5 glass slide data. Data submissions from different scientific publications. Hierarchical clustering and SOM.

q Advantages: § Experiences from building databases for gene expression data§ Open schemas; great help for constructing a customized data model meeting specific local needs

q Drawbacks: § Specifically developed tools for data access and analysis. No ad-hoc query / reporting tools, or integration of commercial tools, e.g.

for data mining§ Limited analysis capabilities: analysis of data from single experiments; main reason: lack of a uniform experiment / sample annotation

mechanism to identify comparable experiments. § No „real“ integration with external data, such as sequences, pathways; navigational access per hyperlinks§ No (semantic) integration with other types of gene expression data, e.g. generated by EST cluster profiling

Referencesq Brazma, A. et al.: Minimum Information about a Microarray Experiment (MIAME) toward Standards for Microarray Data.

Nature Genetics, Vol. 29, 2001q Ermolaeva, Olga et al.: Data Management and Analysis for Gene Expression Arrays. Nature Genetics, Vol. 20, 1998 q Mangalam, H. et al.: GeneX: An Open Source Gene Expression Database and Integrated Tool Set. IBM System

Journal, Vol. 40, No. 2, 2001q Masys, Daniel: Database Design for Microarray Data. The Pharmacogenomics Journal, Vol. 1, No. 4, 2001q Sherlock, G. et al.: The Standford Microarray Database. Nucleic Acids Research, Vol.29, No.1, 2001

Goalsq Systematic management of expression data generated by (local) microarray experimentsq Flexible support for different comparative gene expression studiesq Uniform experiment/sample annotation, especially for cross-experiment analysisq Integration with gene expression data from other local (EST cluster profiling) projectsq Integration with external data from public databasesq Use of tools for analysis, data mining and visualization purposes

Project steps1. Requirement Analysis: Analysis of user requirements, especially functionality, report

examples, security levels; Analysis of internal and external source systems; Determining a quantitative structure for daily / monthly / yearly data

2. Warehouse Modeling: Construction of a multi-dimensional data model for gene expressiondata, construction of ontologies to annotate genes (probe sets), experiments and sample sets; Implementation of the data warehouse in a commercial relational or object-relational DBMS

3. Tool Evaluation: Specification, order and installation of required Warehouse specific hardware(server, net, etc.) incl. software (OS, DBMS, Report tools, etc.); Decision „Make or Buy“ –individual software vs. standard software

4. Warehouse Population: Design and implementation of load management routines forexpression data generated by local experiments and related data from public databases

5. Data Access: Design and implementation of user-friendly web interfaces for common analysis, data submission and administration tasks; Integration of commercial and other public domaintools for searching, querying, data mining

6. Warehouse Pilot and Test

7. Roll-out: Evaluation of different analysis procedures for different molecular biological studies

Approach of a multidimensional data model

Begin Step 2

time

Begin Step 5

2001

Interdisciplinary Centre for Bioinformatics LeipzigInterdisciplinary Centre for Clinical Research Leipzig

User Group A User Group B

Laboratory(Service Center)

experiment /tissue sample /

chips (probe set)

Result Set Result Set

experiment /tissue sample /

chips (probe set)

{tkirsten, hong, dieter, rahm}@informatik.uni-leipzig.de / {krok, eszlingm, pasr}@medizin.uni-leipzig.de

Toralf Kirsten1, Hong Hai Do1, Dr. Dieter Sosna2

Prof. Dr. Erhard Rahm1,2

1Interdisciplinary Centre for Bioinformatics 2Database Grouphttp://bioinformatik.uni-leipzig.de http://dbs.uni-leipzig.de

OLAP Server and Web Server

...Web ClientWeb Client

DB Server

Analysis:Data Mining, Statistical Evaluation, Biological OLAP

Analysis:Data Mining, Statistical Evaluation, Biological OLAP

LoadManager(application)

filetransfer

file system onDB Server

Data WarehouseData Warehouse

central administration

Brain TumorProfiling

Thyroid Signal Transduction

Primates Comparative

Genomics

UniGene Databaselocal file systemIZKF Leipzig

...local file systemOther Institutes

filetransfer

DataMart

DataMart

DataMart

Pathway Databases

dbEST Database

non public sources public sources

StagingArea

ErrorArea

DataWarehouse

Source SystemsSource Systems