aziz a. boxwala, md, phd division of biomedical informatics ucsd 1u54gm095327 10/25/ 2010

35
Enabling Data Sharing in Biomedical Research Integrating Data for analysis, Anonymization, and Sharing (iDASH) Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327 10/25/2010

Upload: daryl

Post on 24-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327 10/25/ 2010. Enabling Data Sharing in Biomedical Research Integrating Data for analysis, Anonymization , and Sharing ( iDASH ). Sharing Biomedical Data Today Public repositories (mostly non-clinical) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Enabling Data Sharing in Biomedical Research

Integrating Data for analysis, Anonymization, and Sharing (iDASH)

Aziz A. Boxwala, MD, PhDDivision of Biomedical InformaticsUCSD

1U54GM095327 10/25/2010

Page 2: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Sharing Biomedical Data– Today

• Public repositories (mostly non-clinical)• Limited DUAs, public fear • Data ‘transmitted’ by FedEx

– Tomorrow• Annotated public databases• Certified trust network• Consented sharing and use

Sharing Computational Resources– Today

• Computer scientists looking for data, biomedical and behavioral scientists looking for analytics

• Processed data not shared• Massive storage and high performance computing limited to a few institutions

– Tomorrow• Teams working to solve a problem (e.g., human genome project)• Processed anonymized data shared for verification and algorithmic

improvement• Secure biomedical/behavioral cloud available to all

Page 3: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Challenges

• Data integration• Maintenance of research subject’s privacy• Respect for research subject’s autonomy• Data analysis due to novel science• Lack of infrastructure

Page 4: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Challenges

• Data integration• Maintenance of research subject’s privacy• Respect for research subject’s autonomy• Data analysis due to novel science• Lack of infrastructure

Page 5: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

labsregistries

genome transcriptome proteome

Integrating Data(from different biological levels)

Genotype RNA

Biomarkers

transcription translation

Population

Protein

Phenotypeclinical data

Page 6: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

UCSD(Epic)

Data matching function: Map D onto data dictionaries

MRN 23212MRN 43244

MRN 6554 MRN 4433

Researcher is authorized to get data D about I for reason R

Return data D

Request about individual I

Request for data D

ID matching function

Remote Monitor DB

MRN 234512

UC Irvine (Eclipsys)

UC Davis(Epic)

UCSF(GE)

Community Partners

Integrating Data(from different institutions)

Page 7: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Application Layer

Search Engine Layer

Data Structures Layer

Computation /Query Layer

Data Layer

Web Search Client

InformationResourceRegistry

OntoQuest

IndexEngine

Web IndexData Index

NIFSTDOntology

KeywordQuery

Processor

Source Query

Wrappers

MediatorRegistry

Web Result Ranker

Post ClusteringEngine

NIF Search CoordinatorResults Display Manager Application Logic

NIF Literature(Textpresso)

ResourceRegistryManager

XML SourceRelational DB

RDF DB

Data Mediator

Web

Registration Client

User RequestManager

NIF Search Coordinator

W. Cat. Manager Index Manager

Data Integrator

Ontology Manager

Pathways DB

W. Result Postprocessor

Page 8: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Data Ingestion and Transformation

Ontology Ingestion and

Transformation

Relat

ional

Quer

y Pr

oces

sor

Tree

Que

ry

Proc

esso

r

Grap

h Que

ry

Proc

esso

rOntoQuest

Index

Str

uctur

esModel-Partitioned Data Store/Service

Ontology Repository

Query Parser

Keyw

ord

Quer

y Pr

oces

sor

Query Planner

Data ReaderData

ReaderData Reader

SubqueryDispatcher

OWL Reader

OBO Reader

RDFS Reader

Semantic & Assn. Catalogs

...

Current Query Architecture

•How to store, index and query ontologies efficiently? •Managing different forms of ontology •Managing multiple inter-mapped ontologies

•How are data-ontology mappings specified?

Result Ranking

Application-LevelPost-processing

Exec

ution

Engin

e

Page 9: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Challenges

• Data integration• Maintenance of research subject’s privacy• Respect for research subject’s autonomy• Data analysis due to novel science• Lack of infrastructure

Page 10: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

The HIPAA Identifiers1. Names2. All geographical subdivisions smaller than a State, except for the initial three digits of a zip code3. Dates (except year) directly related to an individual, including birth date, admission date, discharge date,

date of death and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

4. Phone numbers 5. Fax numbers 6. Electronic mail addresses 7. Social Security numbers 8. Medical record numbers 9. Health plan beneficiary numbers 10. Account numbers 11. Certificate/license numbers 12. Vehicle identifiers and serial numbers, including license plate numbers 13. Device identifiers and serial numbers 14. Web Universal Resource Locators (URLs) 15. Internet Protocol (IP) address numbers 16. Biometric identifiers, including finger and voice prints 17. Full face photographic images and any comparable images18. Any other unique identifying number, characteristic, or code

Page 11: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

HIPAA data sets

• De-identified data set– Does not include 18 identifiers

• Limited data set– can include the following identifiers:

• Geographic data: town, city, State and zip code, but no street address.

• Dates: A limited data set can include dates relating to an individual (e.g., birth date, admission and discharge date).

• Other unique identifiers: A limited data set can include any unique identifying number, characteristic or code other than those specified in the list of 16 identifiers that are expressly disallowed

• Fully identified data set– All identifiers allowed

Page 12: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

IRB concerns

Page 13: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010
Page 14: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Limiting results to counts

• No inherent privacy:

Original Reconstructed

Page 15: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Serving result counts

• Allows:– Cohort finding– Exploration

• Need:– Perturbation

Q

Estimated Count+ Count returned

noise

Page 16: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010
Page 17: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Truly privacy preserving data

• Yields information about distribution independent of any individual data point

• How: Sampling from robust representation of joint probability distribution

learn Sample

Privacy preservingOriginal Robust distribution

Page 18: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Source Anonymization

• Multiple participating data sources (PDSs) contribute data to a central processing unit (CPU)– Cyptographic anonymization cloud:

Page 19: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Challenges

• Data integration• Maintenance of research subject’s privacy• Respect for research subject’s autonomy• Data analysis due to novel science• Lack of infrastructure

Page 20: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Informed Consent

Page 21: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Informed Consent

• Biospecimen and data repositories are creating archives for future, possibly unforeseen types of research

• Does this create challenges in adhering to the autonomy (right to self-determination) principle of biomedical ethics?

• We want to enable subjects to have better control on their participation in research

• Different consents within the same repository will create a challenge for investigators in selecting subjects– Matching research aims to consented uses– Selection biases

Page 22: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Electronic Informed Consent Management• Create an informed consent ontology that can

represent various dimensions of subject’s consent for research

• Develop an electronic informed consent registry that documents the subjects’ consents– Enables subjects to update consent

• Create a mediator that can resolve an investigator’s request for samples, data, or subject participation against the consented uses

Page 23: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Challenges

• Data integration• Maintenance of research subject’s privacy• Respect for research subject’s autonomy• Data analysis due to novel science• Lack of infrastructure

Page 24: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Data Analysis Library

• Genome Data– Compression– Genome query

language

• Pattern recognition• Computing with streams• Rare events

Page 25: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Challenges

• Data integration• Maintenance of research subject’s privacy• Respect for research subject’s autonomy• Data analysis due to novel science• Lack of infrastructure

Page 26: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Data Publishing and Computational Resources• Mismatches

– Data availability– Computational resources and expertise

• iDASH services– Data acquisition, annotation, storage, dissemination– Scientific workflow execution– Governance and policy framework for data access

control– Accessible via web portal and API

Page 27: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Biomedical CyberInfrastructure Architecture

Rich Services developed by Ingolf Krueger and colleagues

Page 28: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010
Page 29: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Driving Biological Projects

• Kawasaki Disease Research• Anticoagulant Medication Safety• Remote Monitoring of Behavior

Page 30: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Kawasaki Disease(PI: Jane Burns)• Aim 1: To sequence size-selected cDNA from whole blood

from KD patients and age-similar children with acute adenovirus infection to identify miRNA abundance patterns and to relate these patterns to disease state and to KD clinical outcome

• Aim 2: To selectively sequence genomic DNA regions in the pathway genes of interest to identify rare genetic variants that may play a functional role in disease susceptibility and outcome

• Aim 3: To create a KD data warehouse and web-based data analysis system aimed at facilitating discoveries using clinical and molecular data

Page 31: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Anticoagualant Medication Monitoring(PI: Fred Resnic)• Aim 1: To determine baseline expectations for

bleeding events for prasugrel and dabigatran, clopidogrel, and warfarin in eligible patients

• Aim 2: To evaluate the usefulness of aggregating information from 3 healthcare centers in an automated risk-adjusted medication safety monitoring tool that alerts for unsafe use of medications in particular cohorts of patients

Page 32: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Monitoring Sedentary Behavior (PI: Greg Norman)• Phase 1

– physical activity behavior pattern recognition and feedback device and test for Device Limiting Failures (DLFs) with 12 adults for two week cycles using a Phase I clinical trial approach.

• Phase 2– efficacy testing of the prototype with iterative improvement/

retesting in 30 sedentary adults with outcomes of accelerometer measured activity and sedentary time evaluated against controls for a 6 week intervention period.

• Phase 3– pilot randomized trial with 48 sedentary adults receiving either the

intervention device or assessments only for a 3 month period evaluated with accelerometer-measured activity and sedentary time.

Page 33: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

New science: new computational needs

• DBP1– Genetic data compression– Pattern recognition– Data integration from different biological levels

• DBP2– Data integration from different institutions

• aggregated results from three medical centers that serve different types of patients (BWH, VA TN, UCSD)

– Rare event detection • DBP3 –

– Pattern recognition from streaming data from personal monitoring– Integration of spatial, temporal, physiological, and behavioral data

Page 34: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

iDASH Team

Page 35: Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327   10/25/ 2010

Thank you

[email protected]