a federated in-memory database computing platform enabling real-time analysis of big medical data

26
A Federated In-Memory Database Computing Platform Enabling Real- time Analysis of Big Medical Data Dr.-Ing. Matthieu-P. Schapranow Hasso Plattner Institute, Potsdam, Germany May 17, 2017

Upload: matthieu-schapranow

Post on 21-Jan-2018

679 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

A Federated In-Memory Database Computing Platform Enabling Real-time Analysis of Big Medical Data

Dr.-Ing. Matthieu-P. Schapranow Hasso Plattner Institute, Potsdam, Germany

May 17, 2017

Page 2: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

■  Can we enable clinicians to take their therapy decisions:

□  Incorporating all available patient specifics,

□  Referencing latest lab results and worldwide medical knowledge, and

□  In an interactive manner during their ward round?

Our Motivation Turn Precision Medicine Into Clinical Routine

Analyze Genomes: A Federated In-Memory Database Computing Platform

2

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Page 3: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

3

Page 4: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Our Vision Medical Board Incorporating Latest Medical Knowledge

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

4

Page 5: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Project Time Line

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

5

2009 2010 2011 2012 2013 2014 2015

SAP HANA launched Oncolyzer SORMAS

Drug Response Analysis

Enterprise Software

Medical Knowledge

Cockpit

Analyze Genomes Platform

IMDB Research

2016 2017

A R T +

T R A M

S + S

M

Page 6: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

The Challenge Distributed Heterogeneous Data Sources

6

Human genome/biological data 600GB per full genome 15PB+ in databases of leading institutes

Prescription data 1.5B records from 10,000 doctors and 10M Patients (100 GB)

Clinical trials Currently more than 30k recruiting on ClinicalTrials.gov

Human proteome 160M data points (2.4GB) per sample >3TB raw proteome data in ProteomicsDB

PubMed database >23M articles

Hospital information systems Often more than 50GB

Medical sensor data Scan of a single organ in 1s creates 10GB of raw data Cancer patient records

>160k records at NCT Analyze Genomes: A Federated In-Memory Database Computing Platform

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Page 7: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

■  Requirements

□  Managed services

□  Reproducibility

□  Real-time data analysis

■  Restrictions

□  Data privacy

□  Data locality

□  Volume of big medical data

Software Requirements in Life Sciences

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

7

http://stevedempsen.blogspot.de/2013/08/agile-software-requirements-comic.html

Page 8: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data

8

In-Memory Database

Analyze Genomes: A Federated In-Memory Database Computing Platform

Page 9: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data

9

In-Memory Database

Combined and Linked Data

Genome Data

Cellular Pathways

Genome Metadata

Research Publications

Pipeline and Analysis Models

Drugs and Interactions

Analyze Genomes: A Federated In-Memory Database Computing Platform

Indexed Sources

Page 10: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data

10

In-Memory Database

Extensions for Life Sciences

Data Exchange, App Store

Access Control, Data Protection

Fair Use

Statistical Tools

Real-time Analysis

App-spanning User Profiles

Combined and Linked Data

Genome Data

Cellular Pathways

Genome Metadata

Research Publications

Pipeline and Analysis Models

Drugs and Interactions

Analyze Genomes: A Federated In-Memory Database Computing Platform

Indexed Sources

Page 11: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data

11

In-Memory Database

Extensions for Life Sciences

Data Exchange, App Store

Access Control, Data Protection

Fair Use

Statistical Tools

Real-time Analysis

App-spanning User Profiles

Combined and Linked Data

Genome Data

Cellular Pathways

Genome Metadata

Research Publications

Pipeline and Analysis Models

Drugs and Interactions

Analyze Genomes: A Federated In-Memory Database Computing Platform

Drug Response Analysis

Pathway Topology Analysis

Medical Knowledge Cockpit Oncolyzer

Clinical Trial Recruitment

Cohort Analysis

...

Indexed Sources

Page 12: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Combined column and row store

Map/Reduce Single and multi-tenancy

Lightweight compression

Insert only for time travel

Real-time replication

Working on integers

SQL interface on columns and rows

Active/passive data store

Minimal projections

Group key Reduction of software layers

Dynamic multi-threading

Bulk load of data

Object-relational mapping

Text retrieval and extraction engine

No aggregate tables

Data partitioning Any attribute as index

No disk

On-the-fly extensibility

Analytics on historical data

Multi-core/ parallelization

Our Technology In-Memory Database Technology

+

+++

+

P

v

+++t

SQL

xx

T

disk

12

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

Page 13: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Scheduling and Execution of Genome Data Processing Pipelines

Analyze Genomes: A Federated In-Memory Database Computing Platform

In-Memory Database

Tasks

Scheduler

ID Pipeline Params 12 BWA xyz.fastq 13 Stanford A_1.fastq 14 Bowtie xyz.fastq

Worker

Worker

Subtasks Task ID Job Status Params

12 97 Split done xyz.fastq

12 98 Import todo abc.vcf

12 98 Import done abc.vcf

Webservice

. . .

1. Trigger task execution

2. Schedule subtasks

3. Execute subtasks

13

Page 14: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Managed Services provided by Federated In-Memory Database System (FIMDB)

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

14

Node i

Worker Worker Worker

IMDB

Node j

Worker Worker Worker

IMDB

Node k

Worker Worker Worker

IMDB

Scheduler

Node m

Worker Worker Worker

IMDB

Relay

Node n

Worker Worker Worker

IMDB ...

Cloud Service Provider (Shared Algorithms and Public Reference Data)

Hospital or Research Department (Sensitive/Patient Data)

VPN

UDP TCP

Shared File System (Pool) Shared File System (Pool)

...

Shared File System (Global)

Page 15: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

■  Not standardized

■  Not exchangeable

■  Concatenation of bash scripts reading from and writing to files

■  Requires IT expertise for

□  Setup

□  Error handling, and

□  Efficient processing and parallelization

■  Objective: Model, configure, and execute pipelines without involving IT experts

Genome Data Processing Pipelines State of the Art

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

15

bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …

Page 16: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

■  Graphical modeling notation

■  Compliant with BPMN 2.0 extended by

□  Modular structure

□  Degree of parallelization

□  Parameters and variables

■  Model descriptions (XPDL) are stored in IMDB

■  Model instances are transformed into graph structure executed by our worker framework

Genome Data Processing Pipelines Standardized Modeling

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

Chart 16

Page 17: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Genome Data Processing Pipelines XML Process Definition Language

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

17

Page 18: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

PIPELINES.MODELS

Database Structure

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

18

PIPELINES.PIPELINES

Page 19: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

■  Results are imported into IMDB

■  Optimization reduced execution time by >50%

Genome Data Processing Pipelines Traditional vs. Optimized Approach

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

19

Page 20: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Reproducibility Modeling of Data Analysis Pipelines 1.  Design time (researcher, process expert)

□  Definition of parameterized process model

□  Uses graphical editor and jobs from repository

2.  Configuration time (researcher, lab assistant)

□  Select model and specify parameters, e.g. aln opts

□  Results in model instance stored in repository

3.  Execution time (researcher)

□  Select model instance

□  Specify execution parameters, e.g. input files

Analyze Genomes: A Federated In-Memory Database Computing Platform

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017 20

Page 21: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

■  Query-oriented search interface

■  Seamless integration of patient specifics, e.g. from EMR

■  Parallel search in international knowledge bases, e.g. for biomarkers, literature, cellular pathway, and clinical trials

App Example: Medical Knowledge Cockpit for Patients and Clinicians

Analyze Genomes: A Federated In-Memory Database Computing Platform

21

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Page 22: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Medical Knowledge Cockpit for Patients and Clinicians Pathway Topology Analysis

■  Search in pathways is limited to “is a certain element contained” today

■  Integrated >1,5k pathways from international sources, e.g. KEGG, HumanCyc, and WikiPathways, into HANA

■  Implemented graph-based topology exploration and ranking based on patient specifics

■  Enables interactive identification of possible dysfunctions affecting the course of a therapy before its start

Analyze Genomes: A Federated In-Memory Database Computing Platform

Unified access to multiple formerly disjoint data sources

Pathway analysis of genetic variants with graph engine

22

Page 23: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

■  Interactively explore relevant publications, e.g. PDFs

■  Improved ease of exploration, e.g. by highlighted medical terms and relevant concepts

Medical Knowledge Cockpit for Patients and Clinicians Publications

Analyze Genomes: A Federated In-Memory Database Computing Platform

23

Page 24: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

App Example: Real-time Assessment of Clinical Trial Candidates

■  Supports trial design and recruitment process through statistical data analysis

■  Real-time matching and clustering of patients and clinical trial inclusion/exclusion criteria

■  Reassessment of already screened or participating citizens to reduce recruitment costs

■  Integrates smoothly with the

Analyze Genomes: A Federated In-Memory Database Computing Platform

Real-time assessment of clinical trial candidates

24

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Page 25: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

■  Online: Visit we.analyzegenomes.com for latest research results, slides, videos, tools, and publications

■  Offline: High-Performance In-Memory Genome Data Analysis: In-Memory Data Management Research, Springer,

ISBN: 978-3-319-03034-0, 2014

■  In Person: Visit us at the HPI booth 200! ■  Join us for Intel Tech Talks at SAPPHIRE booth 669!

□  May 17 01.00pm: A Federated In-Memory Database Computing Platform Enabling Real-time Analysis of Big Medical Data

□  May 18 3.00pm: In-Memory Apps For Precision Medicine

Where to find additional information?

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

25

Page 26: A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data

Keep in contact with us!

Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017

Analyze Genomes: A Federated In-Memory Database Computing Platform

26

Dr. Matthieu-P. Schapranow Program Manager E-Health & Life Sciences

Hasso Plattner Institute

August-Bebel-Str. 88 14482 Potsdam, Germany

[email protected]

http://we.analyzegenomes.com/