data mining for genomic- phenomic correlations · 2010. 8. 23. · documentation. discrete core....

27
Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, Ph.D. Associate Director & Chair, Information Sciences Rebecca Nelson, Ph.D. Lead, Data Mining Section City of Hope National Medical Center Duarte, California, USA

Upload: others

Post on 02-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Data Mining for Genomic- Phenomic Correlations

Joyce C. Niland, Ph.D.Associate Director & Chair, Information Sciences

Rebecca Nelson, Ph.D.Lead, Data Mining Section

City of Hope National Medical CenterDuarte, California, USA

Page 2: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

City of Hope National Medical Center, Duarte, CaliforniaCity of Hope National Medical Center, Duarte, California

Page 3: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

City of Hope National Medical CenterCity of Hope National Medical Center

Founded in 1913Founded in 1913

StateState--ofof--thethe--art care to patients with cancer & other lifeart care to patients with cancer & other life--threatening diseases (e.g. diabetes)threatening diseases (e.g. diabetes)

Leading edge research into the causes, prevention, and Leading edge research into the causes, prevention, and cure of such diseasescure of such diseases

Promising new therapeutic agents being taken from Promising new therapeutic agents being taken from ““benchbench”” to to ““bedsidebedside”” (translational research)(translational research)Over 400 ongoing clinical trials, 1/3 initiated at City of HopeOver 400 ongoing clinical trials, 1/3 initiated at City of Hope

Human genome has expanded scope and objectivesHuman genome has expanded scope and objectivesof translational researchof translational research

Page 4: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Improved diagnosis of diseaseImproved diagnosis of diseaseEarlier detection of genetic riskEarlier detection of genetic risk

Repairing defective genes with healthy onesRepairing defective genes with healthy ones

New drugs based on information about genesNew drugs based on information about genes

Page 5: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Recent Examples of Genomic-Phenomic Data Mining Supported by Biostatistics

Predictors of Genetic Susceptibility to Heart Disease Post-Bone Marrow Transplant

Pathogenesis of Radiation-induced Breast Cancer

Gene Expression in Prostate Cancer Tumors

Prognostic Biomarkers for Stage I-III Renal Cell Carcinoma

Validation of Biomarkers for Tumor Initiating Cells in Brain Cancer

Expression of DNA Repair in Normal versusTumor Cell Genes

Page 6: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

The 3 (4?) I’s of Data Mining Process

Integrate Data Sources

Include Appropriate Samples

Identify (Infer?) Subjects Wishes

Data Warehouse

Cohort Assembly

Honest Broker

Page 7: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Integrating Data Systems to Support Biomedical Research

Biomedical research is an increasingly complex collaborative undertaking

Requires integration of data, rules, processes, and vocabularies from many different source systems

Most information systems developed independentlyOperational systems, created to meet different functional and departmental needs of an institution

Page 8: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Operational Systems Versus Data Warehousing

On-Line Transaction Processing (OLTP): Focuses on an organization’s day-to-day business needs (electronic medical record, financial systems, clinical trial management systems)

On-Line Analytic Processing (OLAP):Retrieves, analyzes, reports, and shares data from disparate systems, vendors & departments (DW)

8

Page 9: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Data Warehousing Concept

9

Large, centralized, and longitudinal store of data to facilitate organization-wide consolidated reporting and analysis

o Multiple source databases

o Central coordination and management via metadata repository

o Multiple target “data marts” (aggregated datasets for efficient querying and analysis)

Page 10: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Clinical Research

Basic Science

Eclypsis

ElectronicMedicalRecord(EMR)

Caregiver Documentation

Discrete CoreData Elements from Patient Care

Abstraction of Core Phenomic Dataon All COH Patients

MedidataElectronic DataCapture (EDC)

Management ofPatients onClinical Trials

Extract, Transform & Load (ETL)

COH DataCOH DataWarehouse:Warehouse:

Phenomicand

GenomicInformation

forReporting &Data Mining

MIDASObservationalData System

ETL

HospitalQA

Reporting

HypothesisGeneration &

Grant Proposals

Disease Cluster

Reporting &Analysis

Genotype –PhenotypeCorrelativeAnalyses

Data Validation &Quality Assurance (QA) M

ETADATA

LAYER

City of Hope (COH) Data Warehouse Overview

Patient Care

SourceData Systems

Sunquest

SurgicalInfo System

(SIS)Radiology

Info System(RIS)

SafeTraceTransfusion

Medicine

CoPath

Financials Trendstar

Financial Data

ETL

OLAP “Cubes”

OLTP OLAP

10

CancerRegistry

Labware

BiospecimenRepository

Solexa

GeneAnalysis

Microarray

HighThroughput

AnalysisSample Data

Genomic / Proteomic Results

ETL

ETL

ETL

ETL indicates in progress

Page 11: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

While protecting personally identifiable information and proprietary research data:

o Decision support to administrators

o Screening of patients for eligibility

o Measurement of quality of care & outcomes

o Query capabilities to investigators

o Data mining to generate new hypotheses, facilitate new discoveries

Utility of a Data Warehouse

11

Page 12: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Technical & Business Metadata Directories

Data SourcesTechnical nameData type & lengthCreation, expiration datesSource systemData ‘steward’

MappingsRules for merging/filtering

Validation RulesMissing value fieldsData integrity, consistency

Transformation RulesDerivation of valuesData summaries

Data DefinitionField names, aliasesDescription of data meaning

Data DirectivesInstructions for data collectionGuidelines for data coding

QueriesSynonymsClassification coding

ReportsList of reports that use term

Security InformationAuthorization to access

*Database Administrator Perspective*Database Administrator Perspective **Database User Perspective**Database User Perspective

MMEETTAADDAATTAA

RREEPPOOSSIITTOORRYY

Technical Metadata:*Technical Metadata:* Business Metadata:**Business Metadata:**

Page 13: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Cohort Assembly

Inclusion of subjects with appropriate phenomic characteristics ANDavailable tissue

>360,000 specimens logged in CoPathsystem, going back to 1955

Critical to integrate tissue sample data into the data warehouseBroken down into “Class of Case” to describe type of specimen

Page 14: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

CoPath Specimens by Year and Type

0

5000

10000

15000

20000

25000

30000

1955

1958

1960

1962

1964

1966

1968

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

Routine SurgicalOtherOutside ConsultsHematopathologyCytologyCard ClassBlood

Page 15: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Integrate “Best” Source of Tissue Annotation Data

Standardized formatted path reports now available from pathologist

‘Synoptic Report’ includes:Path T StageNodes ExaminedNodes PositivePath N StagePath M StageMarginsHistologyGrade

Page 16: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Concordance Between CoPath Synoptic Reports & Cancer Registry

0.99

0.820.91 0.88 0.84

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Prostate Breast Colon andRectum

Kidney Lung

Page 17: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Comparing Path Data Sources: Synoptic Report vs. CNExT

Synoptic Report - Pathology

Surgery specific

Only 22% of cancer cases

Non-strict reporting rules

No confirmation by treating MD

CNExT – Cancer Registry

Patient specific

100% of cancer cases

Strict reporting rules

Confirmed by treating MD

Page 18: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Text Mining to Identify Cases

Text mining may be needed if neither data source specific enough

Example: Diagnosis of Unknown Primary Sites for Metastatic Tumors via 92-Gene RT-PCR

Needed to find tissues samples labeled as purely metastatic, with no link to original cancer site

Page 19: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Enabling Investigators to Conduct Cohort Searches via i2b2

i2b2: Integrating Biology and the BedsideFacile user interface to allow investigators to search for cohorts on their own

Executes advanced queries against meta-database to identify available subjects, tissue samples, or biospecimens matching query criteria if feasible number of cases returned, then submit IRB protocol for approval

Biostatistics Division first needs to provide ‘Honest Broker’ service to eliminate any dissenting patients

Page 20: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Definition of “Honest Broker”

Impartial party and process to determine whether patient’s wishes would be violated by:

Analyzing their data arising from standard care, orStudying their discard tissue samples

Moving to single “General Research Consent” for all patients going forward

However many different consents used for various studies in the past, must be considered

Page 21: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Protocols Requiring “Honest Broker” Process

Total ProtocolsN=3,299

Protocols with Consent for Specific Interventions

N=2,314

Non-intervention StudiesN= 985

Consent RequiredN=365

Consent Not RequiredN=620

No Honest Broker Honest Broker Required

Page 22: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Patient Consent Status

Consent Status N Percent

Consented 70,133 92.5

Dissented 5,637 7.4

Consent Withdrawn(Dissented) 25 < 1.0

Total 75,795

Page 23: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Honest Broker Algorithmic Approach

Use computerized algorithms to evaluate: Consent TypeParticipation Type

Any “no” response to a consent/participationtype related to objectives of the study do not include in cohort

Note: Consents change over timeMay require different algorithms dependingon protocol version

Page 24: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Participation Type Field Examplesagree to allow the collection and storage of samples from tissue removed during surgical resection of breast tissue (performed as part of my routine clinical care), and for the specimens to be linked to the clinical data obtained from my medical records for the purposes of cancer research

I agree to allow the collection of clinical data to be stored in a clinical research database. I understand that this clinical database will be updated with data obtained from my ongoing medical care at City of Hope

I agree to have my medical information and my blood stored for future research

I agree to have my specimens stored for future research purposes

I agree to have my specimens stored for future research purposes.

I agree to have my/my child's tissue or specimens stored for future research purposes.

I agree to have my/my child’s name and contact information released to an agency under contract with COH, one of which is Examined Management Services, Inc. (EMSI)

I agree to have tissue stored for future research

I agree to have tissue stored for future research.

I agree to participate in the focus group

I agree to provide a blood sample to be used as part of this study

I agree to provide a urine sample to be used as part of this study

I agree to the collection and storage of samples from tissue removed during biopsy or surgical resection of prostate cancer (performed as part of my routine clinical care) for the purposes of cancer research

I agree to the collection of blood samples during routine clinical care that will be stored in blood bank for the purposes of cancer research

I agree to udnergo research blood sample collection at the first 3 proposed time points as part of this study

I agree to undergo research blood sample collection at all proposed time points as part of this study

I do not agree to have my child’s name and contact information released to an agency under contract with COH, one of which is Examined Management Services, Inc. (EMSI)

I will allow additional needle biopsy specimens to be obtained during a routine and required diagnostic procedure for the purposes of cancer research

Page 25: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

Participation Type Responses

218 different participation types across all non-interventional studies

Multiple participation types within a given consent form require complex coding algorithms:

Example of participations within 1 consent form:

I allow my data to be collected for research=‘Yes’

I allow a routine sample to be used for research=‘Yes’

I allow an extra sample to be collected for research=‘No’

I allow my clinical and sample data to be linked=‘No’

Page 26: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management

ConclusionsAdvances in personalized medicine will require genome-phenome correlations

Data warehousing is an optimal approach

Business metadata are critical for valid use of data

Requires complex computational algorithms to

Assemble appropriate cohort

Mine data, particularly non-standardized, text

Ensure valid linkages and data extraction

Protect patient privacy wishes

Page 27: Data Mining for Genomic- Phenomic Correlations · 2010. 8. 23. · Documentation. Discrete Core. Data Elements from Patient Care. Abstraction of Core Phenomic ... Capture (EDC) Management