case study for clinical relevancy: asthma scott t. weiss, m.d., m.s. brigham and women’s hospital...
TRANSCRIPT
Case Study for Clinical Relevancy: Asthma
Scott T. Weiss, M.D., M.S.Scott T. Weiss, M.D., M.S.
BRIGHAM AND WOMEN’S HOSPITAL
HARVARDMEDICAL SCHOOL
Professor of MedicineHarvard Medical SchoolDirector, Center for Genomic MedicineDirector, Program in BioinformaticsAssociate Director, Channing LaboratoryBrigham and Women’s HospitalBoston, MA
Outline
• Context: focus on process and data• Overview of Asthma DBP• Smoking as an example of the data issues• Predicting COPD in those with asthma• Predicting asthma exacerbations• Genetic prediction of asthma exacerbations
current status• DNA collection• Lessons Learned• Conclusions
Context
• Channing Lab - extensive genetics & pharmacogenetics resources focused on airways diseases
• Faculty with clinical, epidemiology, genetic, and bioinformatics training and experience
• multidisciplinary research collaborative track record
• Good i2b2 driver: from bench to clinic• Strong focus and direction for Cores
Broad Goals of Channing Program in Predictive Medicine
• Genetic variation clinical practice Disease risk (asthma diagnosis) Natural history (exacerbations) Individual response to medication (pharmacogenetics)• Develop predictive tests (genetic and nongenetic) in
Channing populations• Validate these tests in Partners asthma cohort (PAC) at
least as proof of concept
I2B2 Airways DBP: Overview
RPDR
Partners Clinical Services
Extractdata from Airways Diseasepatients
Extract relevant quantitative and
coded phenotypes
Extract importantphenotypes
from text: NLP
Predict clinical outcomes
after adjustment for
covariates
RPDR:Recruit, validate,genotype
Developstatistical
models
Before we start
• Numerous important covariates• e.g. age, tobacco, comorbidities,
medications• Adjust outcomes for covariates• Some (eg age, gender,Dx, encounter)
readily available• Obtained through Core 4• Others require substantial effort e.g.
medications, tobacco use, comorbid conditions
• Collaboration - NLP experts in Core 1
Phenotypes from text
• Extract specific data items– Medication– Smoking status– Diagnoses (Co-morbidity)
• Extract findings to assist with case selection
• Extract findings to assist with clinical predictions
Smoking Status- Examples
HOSPITAL COURSE: ... It was recommended that she receive …We also added Lactinax, oral form of Lactobacillus acidophilus to attempt a repopulation of her gut.
SH: widow,lives alone,2 children,no tob/alcohol.
BRIEF RESUME OF HOSPITAL COURSE: 63 yo woman with COPD, 50 pack-yr tobacco (quit 3 wks ago), spinal stenosis, ...
SOCIAL HISTORY: Negative for tobacco, alcohol, and IV drug abuse.
SOCIAL HISTORY: The patient is a nonsmoker. No alcohol.
SOCIAL HISTORY: The patient is married with four grown daughters,uses tobacco, has wine with dinner.Smoker
Non-Smoker
SOCIAL HISTORY: The patient lives in rehab, married. Unclear smoking historyfrom the admission note…
Past Smoker
???
Hard to pick
Hard to pick
Smoking -Text Processing
952 Past smoker
427 Never smoked
146 Denies smoking
Cases per class
50No. Attributes
261 Control cases
1010 Current Smoker
5No.Classes
2796No. Cases
Manually classified
Smoking Status
• Raw sample ~ 20,000 reports• Feature extraction >3000• Feature selection 25 - 1000• “Gold standard” sample cases ~ 2,800• Correct classification rate 46 - 81%
(compared to Gold Standard)
Preliminary results
Smoking Status
80.46231CV 10xNaïve BayesStemmedone-gram
80.92917CV 10xNaïve BayesStemmedone-gram
70.7325Split 2/3Naïve BayesBi-gram
49.5725Split 2/3SVMBi-gram
78.0250Split 2/3Naïve BayesOne-gram
25
25
50
No. Features
Split 2/3
Split 2/3
Split 2/3
TestCases
Naïve Bayes
SVM
SVM
Classification Method
79.70One-gram
More …
65.05Tri-gram
44.63Tri-gram
% CorrectlyClassified
Data Set
Increase, combine features should improve performance
Baseline performance
Preliminary results
Feature Analysis
ClassificationClusteringStatistical Analysis …
Data Mining Pipeline
“Raw” Patient Data
------------------------
------------------------
------------------------
------------------------
------------------------
------------------------
Text Processing Word/pattern filters StemmingLexicon matching Parsing …
Data Extraction
“Smart Data” Medications Smoking status Co-morbidity
Asthma Preceding COPD
• Significant overlap of asthma and COPD DX
• Common denominator = smoking
• Asthma is known to precede and predict the development of COPD independent of smoking
• Could we develop a multivariate clinical predictor that would predict which asthmatics would get COPD?
Study Design
Source: Partners Healthcare Research Patient Data Repository (RPDR).
RPDR: MGH, BWH, etc clinical repository for researchers.
Training: 9349 asthmatics (843 COPD, 8506 controls) first encounter 1988 1998.
Test: A future set of 992 asthmatics (46 COPD, 946 controls) first encounter from 1999-2002.
Data Collection
Criteria: Patients observed for at least 5 years, at least 18 at the first encouter, and race, sex, height, weight, and smoking available.
Comorbodities: International Classification of Diseases, 9th Revision (ICD-9) codes as admission diagnosis or ER primary diagnosis (104)
COPD: ICD-9 code for “Chronic Bronchitis”, “Emphysema” “Chronic Airways Obstruction, not otherwise specified.”
Analysis
Model: A Bayesian network was generated from the training set of 9349 asthmatics (843 COPD, 8506 controls) encountered between1988 and 1998 from 104 comoribities and race, gender, age, smoking.
Results: The risk of COPD is modulated by gender, race, and smoking history, and 14 comorbidities: Viral and chlamydial infections, diabetes mellitus, volume depletion, acute myocardial infarction, intermediate coronary syndrome, cardiac dysrhythmias, heart failure, acute upper respiratory infections, acute bronchitis and bronchiolitis, pneumonia, early or threatened labor, normal delivery, shortness of breath, respiratory distress.
Network Model
Validation
Propagation: a Bayesian network can compute the probability distribution of any variable given an instance of some or all the other variables.
Test data: a future set of 992 asthmatics (46 COPD, 946 controls) first encounter from 1999-2002.
Prediction: for each patient, predict the probability of COPD given the other elements in the network (co-morbidities and demographics).
Validation: compare the predicted with the observed COPD status.
Predictive Validation
One variable at the time
Asthma Exacerbations
• Asthma attacks involve worsening of asthma symptoms including bronchoconstriction and inflammatory response
• Major cause of morbidity and mortality in asthma• 11.7 million Americans have an exacerbation every
year (3.9 million children) • In US children, exacerbations are the third leading
cause of hospitalizations (198,000 occurrences per year)
• Cost of asthma exacerbations US=4 billion dollars, Partners=20 million dollars
RPDR Exacerbation Prediction
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Specificity
Sen
siti
vity
.67 any ER/hosp visits >2 ER/hosp visits >3 ER/hosp visits
Genetic Prediction of Asthma Exacerbation
Objective Predict asthma exacerbation from genetic dataSubjects 290 CAMP participants
• Not on steroids• Followed for 10+ years • Have genetic data available
Phenotype Case: Reported overnight hospitalization(s) (n=83) Control: No overnight hospitalizations or ER visits (n=207)
Genotype 2443 SNPs from 349 candidate genes
• In Hardy-Weinberg equilibrium among controls• Minor allele frequency > 0.05
Exacerbation Model
132 of 2443 SNPs in 55 of 349 genes
predict exacerbation
ValidationMethod: Prediction on fitted valuesResult: Area under the ROC curve (AUROC) is 0.97
AUROC = 0.97
AUROC measures accuracy as trade-off between sensitivity and specificity
AUROC Rating
0.5 - 0.6 Fail
0.6 - 0.7 Poor
0.7 - 0.8 Fair
0.8 - 0.9 Good
0.9 - 1.0 Excellent
Cross-ValidationMethod: 20-fold cross-validation to test robustness
1. Data is split into 20 groups2. One group is used as independent and remaining 19 are used to quantify the model3. (2) is repeated until each group has been independent set
Result: AUROC is 0.84 (good)
AUROC = 0.84
Partners Asthma DNA collection #1
• Recruit Partners asthma patients • Partners Asthma Center, NWH, MGH • High quality spirometric phenotyping• Blood for DNA extraction and storage• Children and adults• High cost (>$1000/subject)• Low intensity 6 months only 100 subjects
recruited• Doctors and patients need education
Partners Asthma DNA collection #2
• Recruit Partners asthma cohort patients • Leverage CRIMSON blood samples• Leverage data mart for phenotype data• Blood for DNA extraction and storage• Children and adults cases and controls• low cost (<$30/subject)• High intensity 9 months >3000 subjects
recruited
Figure 1
Data Flow for Asthma DBP
Channing RPDR
ADMPN# Send to RPD converts ADMPN#to MRN sendsto pathology
Pathology (Crimson)MRN Crimson ID#
ADMPN sends back to Channing with sample for DNA extraction
Figure 1 LegendDeidentified data file analyzed by Channing subjects for DNA collection selected. File sent to RPDR converted back to MR# and sent to Crimson. Samples identified and given Crimson ID# ≡ ADMPN and sample Sent back to Channing.
-10
10
30
50
70
90
110
130
150
170
190
210
230
250
270
290
310
330
350
370
May May-Jun May-Jul May-Aug May-Sept May-Oct May-Nov May-Dec May-Jan May-Feb May-Mar
Hi Cauc
Lo Cauc
Hi Af Am
Lo Af Am
Recruitment for DBP from Crimson at BWH: Asthma Cases by Utilization and Race
-40
10
60
110
160
210
260
310
360
410
460
510
560
610
660
710
760
810
860
910
960
1010
1060
1110
1160
1210
1260
1310
May May-Jun May-Jul May-Aug May-Sept
May-Oct May-Nov May-Dec May-Jan May-Feb May-Mar
Cauc Asthma
Cauc Controls
Af Am Asthma
Af Am Controls
Recruitment for DBP from Crimson at BWH: Asthma Cases and Controls by Race
Summary of Samples to 04/07/08
59High Caucasian:
880Controls African American:
222Low African American:
1,341Controls Caucasian:
454Low Caucasian:
111High African American:
Running total:
Lessons learned 1
• Get what you ask for
• Regular meetings, regular meetings
• Negotiate your demands
• Tools are not enough
• Leverage your peers
• Recruiting patients is hard work
• IRB is hard work
Lessons learned 2
• You can never have enough statistics or bioinformatics
• Genotyping and its technologies are secondary
• The RPDR data are dirty!
• Listen to Shawn
• Be flexible
Summary: Airways disease as a driver for i2b2
• “Typical” complex disease challenge
• Big impact on health care system
• Potential for large clinical impact
• Core 1: Extracting phenotypes from free text; statistical models
• Core 2: Viewer for CRC
• Core 4: Data provisioning
Conclusions
• The stronger the existing program, the more successful the I2B2 collaboration
• Communication is key
• Fit the question to the data not the other way around
• Data access will be an issue for the future
Collaborators (and what they did)
• Scott, Zak, John, and Susanne: money, project management, IRB, and big picture
• Ross: Channing bioinformatics, file structures, geek to geek translation with the cores, beta testing, 850 collection, IRB, links to other genetic bioinformatics tools and projects
• Shawn and Vivian: asthma and control data mart• Anne, LJ, James: nongenetic predictors in CAMP• Marco and Blanca: nongenetic predictors in PAC• Marco and Blanca: genetic predictors in CAMP• Marco and Blanca: genetic predictors in PAC• Lynn: Crimson
Acknowledgments:
Ross Lazarus Susanne Churchill
Blanca E. Himes Anne Fuhlbrigge
Marco F. Ramoni LJ Wei
Isaac Kohane James Sigornivitch
Shawn Murphy Lynn Bry