data café — a platform for creating biomedical data lakes

1

Data Café — A Platform For Creating Biomedical Data LakesPradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2

1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal2 Department of Biomedical Informatics, Emory University, Atlanta, USA

www.sharmalab.info

2

Data Landscape for Precision

MedicineDATA CHARACTERISTICS• Large number of small datasets• Structured…Semi-structured …

Unstructured…Ill formed• Noisy and Fuzzy/Uncertain• Spatial, Temporal relationships

DATA MANAGEMENT• Variety in storage and

messagingprotocols

• No shared interface

3

Illustrative Use Case

Execute a Radiogenomics workflow on the diffusion images of GBM

patients who received a TMZ + experimental regimen with an overall

survival of 18months or more.

Execute a Radiogenomics workflow on the diffusion images of GBM

patients who received a TMZ + experimental regimen with an overall

survival of 18months or more

PACS + EMR + AIM + RT + Molecular

4

Motivation• Most current solutions require a DBA to initiate the migration of

data into a Data Warehousing environment• to query and explore all the data at once.

• Costly to set up such warehouses.• Unified warehouse with access to query and explore the data.

• Limitations• Scalability and extensibility to incorporate new data sources • A priori knowledge of the data models of the different data sources.

BIOMEDICAL DATA LAKES• Cohort Discovery and Creation — Assembled per-

study

• Heterogeneous data collected in a loosely structured fashion.

• Agile and easy to create.

• Integrate with data exploration/visualization via REST APIs.

• Problem or hypothesis specific virtual data set.

• Powered by Drill + HDFS, Data Sources via APIs.

6

Data Café• An agile approach to creating and extending the concept of a star

schema• to model a problem/hypothesis specific dataset.• by leveraging Apache Drill to easily query the data.

• Tackles the limitations in the existing approaches.• Provides researchers the ability to add new data models and sources.

7

Core ConceptsStep 1. Given a set of data

sources, create a graphical representation of the join attributes.

This graph represents how data is connected across the various data sources

8

Core ConceptsStep 2. Run a set of parallel

queries on the data sources that include the attributes that are present in the query graph.

In the top figure, our query is of type: {id1: A1 > x and B2 == y}

We run similar queries across C, D and E and retrieve the set of relevant id’s (join attributes).

9

Core ConceptsStep 3. Compute intersection

across the various id’s (join attributes). The data of interest can now be obtained using the id’s in this intersection.

A subsequent query will allow us to stream, in parallel, data from individual sources, given the relevant ids (join attributes)

10

Data Café Architecture

11

Apache Drill• Variety – Query a range of non-relational data sources.• Flexibility.• Agility – Faster Insights.• Scalability.

12

Evaluation Environment• Data Café was deployed along with the data sources and Drill in

Amazon EC2.• MongoDB instantiated in EC2 instances.• Hive on Amazon EMR (Elastic MapReduce).• EMR HDFS was configured with 3 nodes. • Various datasets for evaluation

• Two synthetic datasets. • Clinical Data from the TCGA BRCA collection

13

Results• Quick creation of data lakes

• without prior knowledge of the data schema.• Very fast execution of large queries

• with Apache Drill.• Data Café can be an efficient platform for exploring an integrated

data source.• Integrated data source construction process may be time

consuming.• Less critical path. • Done less frequently than the data queries from HDFS/Hive using Drill.

14

Conclusion• A novel platform for integrating multiple data sources.

• Without a priori knowledge of the data models of the sources that are being integrated.

• Indices to do the actual integration • Enables parallelizing the push of the actual data into HDFS.

• Apache Drill as a fast query execution engine that supports SQL.• Currently ingesting data from TCGA.

15

Current State and Future Plans• Ongoing efforts to evaluate the platform with diverse and

heterogeneous data sources.

• Expanding to a larger multi-node distributed cluster.

• Integration with DataScope.

• Multiple data stores and larger data sets.

• Integration with imaging clients such as caMicroscope, as well as archives such as The Cancer Imaging Archive (TCIA).

Acknowledgements

Google Summer of Code 2015NCIP/Leidos 14X138, caMicroscope — A Digital Pathology Integrative Query System; Ashish Sharma PI Emory/WUSTL/Stony BrookNCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory)The results published here are in part based upon data generated by the TCGA Research Network:http://cancergenome.nih.gov/

For more information including recent updates please visit: www.sharmalab.info

[email protected]

data café — a platform for creating biomedical data lakes

Healthcare