data café — a platform for creating biomedical data lakes

17
1 Data Café — A Platform For Creating Biomedical Data Lakes Pradeeban Kathiravelu 1,2 , Ameen Kazerouni 2 , Ashish Sharma 2 1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Department of Biomedical Informatics, Emory University, Atlanta, USA www.sharmalab.info

Upload: pradeeban-kathiravelu

Post on 20-Feb-2017

376 views

Category:

Healthcare


0 download

TRANSCRIPT

Page 1: Data Café — A Platform For Creating Biomedical Data Lakes

1

Data Café — A Platform For Creating Biomedical Data LakesPradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2

1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal2 Department of Biomedical Informatics, Emory University, Atlanta, USA

www.sharmalab.info

Page 2: Data Café — A Platform For Creating Biomedical Data Lakes

2

Data Landscape for Precision

MedicineDATA CHARACTERISTICS• Large number of small datasets• Structured…Semi-structured …

Unstructured…Ill formed• Noisy and Fuzzy/Uncertain• Spatial, Temporal relationships

DATA MANAGEMENT• Variety in storage and

messagingprotocols

• No shared interface

Page 3: Data Café — A Platform For Creating Biomedical Data Lakes

3

Illustrative Use Case

Execute a Radiogenomics workflow on the diffusion images of GBM

patients who received a TMZ + experimental regimen with an overall

survival of 18months or more.

Execute a Radiogenomics workflow on the diffusion images of GBM

patients who received a TMZ + experimental regimen with an overall

survival of 18months or more

PACS + EMR + AIM + RT + Molecular

Page 4: Data Café — A Platform For Creating Biomedical Data Lakes

4

Motivation• Most current solutions require a DBA to initiate the migration of

data into a Data Warehousing environment• to query and explore all the data at once.

• Costly to set up such warehouses.• Unified warehouse with access to query and explore the data.

• Limitations• Scalability and extensibility to incorporate new data sources • A priori knowledge of the data models of the different data sources.

Page 5: Data Café — A Platform For Creating Biomedical Data Lakes

BIOMEDICAL DATA LAKES• Cohort Discovery and Creation — Assembled per-

study

• Heterogeneous data collected in a loosely structured fashion.

• Agile and easy to create.

• Integrate with data exploration/visualization via REST APIs.

• Problem or hypothesis specific virtual data set.

• Powered by Drill + HDFS, Data Sources via APIs.

Page 6: Data Café — A Platform For Creating Biomedical Data Lakes

6

Data Café• An agile approach to creating and extending the concept of a star

schema• to model a problem/hypothesis specific dataset.• by leveraging Apache Drill to easily query the data.

• Tackles the limitations in the existing approaches.• Provides researchers the ability to add new data models and sources.

Page 7: Data Café — A Platform For Creating Biomedical Data Lakes

7

Core ConceptsStep 1. Given a set of data

sources, create a graphical representation of the join attributes.

This graph represents how data is connected across the various data sources

Page 8: Data Café — A Platform For Creating Biomedical Data Lakes

8

Core ConceptsStep 2. Run a set of parallel

queries on the data sources that include the attributes that are present in the query graph.

In the top figure, our query is of type: {id1: A1 > x and B2 == y}

We run similar queries across C, D and E and retrieve the set of relevant id’s (join attributes).

Page 9: Data Café — A Platform For Creating Biomedical Data Lakes

9

Core ConceptsStep 3. Compute intersection

across the various id’s (join attributes). The data of interest can now be obtained using the id’s in this intersection.

A subsequent query will allow us to stream, in parallel, data from individual sources, given the relevant ids (join attributes)

Page 10: Data Café — A Platform For Creating Biomedical Data Lakes

10

Data Café Architecture

Page 11: Data Café — A Platform For Creating Biomedical Data Lakes

11

Apache Drill• Variety – Query a range of non-relational data sources.• Flexibility.• Agility – Faster Insights.• Scalability.

Page 12: Data Café — A Platform For Creating Biomedical Data Lakes

12

Evaluation Environment• Data Café was deployed along with the data sources and Drill in

Amazon EC2.• MongoDB instantiated in EC2 instances.• Hive on Amazon EMR (Elastic MapReduce).• EMR HDFS was configured with 3 nodes. • Various datasets for evaluation

• Two synthetic datasets. • Clinical Data from the TCGA BRCA collection

Page 13: Data Café — A Platform For Creating Biomedical Data Lakes

13

Results• Quick creation of data lakes

• without prior knowledge of the data schema.• Very fast execution of large queries

• with Apache Drill.• Data Café can be an efficient platform for exploring an integrated

data source.• Integrated data source construction process may be time

consuming.• Less critical path. • Done less frequently than the data queries from HDFS/Hive using Drill.

Page 14: Data Café — A Platform For Creating Biomedical Data Lakes

14

Conclusion• A novel platform for integrating multiple data sources.

• Without a priori knowledge of the data models of the sources that are being integrated.

• Indices to do the actual integration • Enables parallelizing the push of the actual data into HDFS.

• Apache Drill as a fast query execution engine that supports SQL.• Currently ingesting data from TCGA.

Page 15: Data Café — A Platform For Creating Biomedical Data Lakes

15

Current State and Future Plans• Ongoing efforts to evaluate the platform with diverse and

heterogeneous data sources.

• Expanding to a larger multi-node distributed cluster.

• Integration with DataScope.

• Multiple data stores and larger data sets.

• Integration with imaging clients such as caMicroscope, as well as archives such as The Cancer Imaging Archive (TCIA).

Page 16: Data Café — A Platform For Creating Biomedical Data Lakes

Acknowledgements

Google Summer of Code 2015NCIP/Leidos 14X138, caMicroscope — A Digital Pathology Integrative Query System; Ashish Sharma PI Emory/WUSTL/Stony BrookNCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory)The results published here are in part based upon data generated by the TCGA Research Network:http://cancergenome.nih.gov/

Page 17: Data Café — A Platform For Creating Biomedical Data Lakes

For more information including recent updates please visit: www.sharmalab.info

[email protected]