analyzing structured and unstructured data in electronic
TRANSCRIPT
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Analyzing Structured and Unstructured Data
in Electronic Health Records
Jing Zhao & Aron HenrikssonDepartment of Computer and Systems Sciences (DSV)
Stockholm University
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
• Administrative and clinical patient data collected systematically and routinely
• Capture and integrate data on all aspects of care over time
• Comprise various data types
• Sensitive data
Electronic Health Records
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
• Administrative and clinical patient data collected systematically and routinely
• Capture and integrate data on all aspects of care over time
• Comprise various data types
• Sensitive data
Electronic Health Records
• Health records from Karolinska University Hospital
‣ 5 years: 2006-2010
‣ A large variety of clinics in the Stockholm region
‣ 1M patients
‣ 10K diagnosis codes, 3M diagnoses
‣ 1.3K drugs, 2M prescriptions
‣ 700 clinical measurement types, 15M measurements
‣ 4M token types, 1.5B tokens
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
• Under-reporting (especially under-coding) still exists
• High dimensionality and sparsity
• Integrating various types of data
• Clinical notes noisy
Benefits & Challenges of Using EHRs
• Large amounts of longitudinal observations
• Holistic perspective of patients’ clinical conditions
• Systematically collected and archived data
• Clinical notes complementing structured information
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Clinical Text - A Peculiar Genre• A flexible communication tool, documenting healthcare across
‣ space — between clinicians
‣ time — “memos”
• Written under serious time constraints
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Clinical Text - A Peculiar Genre• A flexible communication tool, documenting healthcare across
‣ space — between clinicians
‣ time — “memos”
• Written under serious time constraintstelegraphic, ungrammatical sentences non-standard shorthand misspellings
synonyms multiword terms negation
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Clinical Text - A Peculiar Genre
Types Tokens Type/Token Ratio
General Corpus
SUC 3.0 0.1M 1M 0.1
Medical Corpora
Vårdguiden 0.05M 2.8M 0.0160
Läkartidningen 0.4M 20M 0.0230
Stockholm EPR Corpus 3.9M 1582M 0.0024
Stockholm EPR Corpus: 72% verbless and 10% subjectless sentences!
11% in SUC 3.0 15% in Läkartidningen
0.4% in SUC 3.0 0.05% in Läkartidningen
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Clinical Language Processing: Challenges
part-of-speech taggers named entity recognizers
syntactic parsers co-reference resolvers
general domain clinical domain
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Clinical Language Processing: Challenges• Text data, modeled as a bag of words, is very high-dimensional and sparse
‣ 4M types 4M features?
‣ Exacerbated by lexical variability of concepts
• Alternatives to the bag-of-words approach?
patient patiant pat pathological fever pain cold …Ex1 3 1 0 2 1 0 2 …
Ex2 0 0 1 1 0 0 1 …
Ex3 1 0 0 0 3 0 0 …
synonymyhomonymy
patient pathological
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Distributed Word Representations• Model lexical semantics based on co-occurrence information
‣ words that appear in similar contexts tend to have similar meanings (Zellig Harris, John R. Firth, Ludwig Wittgenstein, …)
• Vector representations in reduced-dimensional semantic space (d << vocabulary size)
itchy asthma chemotherapy …allergy 87 44 2 …
hypersensitivity 46 55 4 …
cancer 0 8 84 …
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Random Indexing• Efficient and scalable algorithm for creating semantic spaces
‣ No dimensionality reduction applied to term-term matrix
‣ Incrementally obtains co-occurence information in pre-reduced semantic vectors
• Two types of vectors (of dimensionality d << vocabulary size):
‣ index vectors — used only in construction phase
‣ semantic vectors — word meaning representations; make up semantic space
• Each unique word wj assigned an index vector IVj and a semantic vector SVj
‣ IV: sparse, randomly assigned a small number of 1s and -1s
‣ SV: sum of IVs of the words that wj co-occurs with within a given window size
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
⟨0,0,0,1,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,-1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,-1,0,0,0,0,-1,0,0,0,0⟩SVtremor
patient experiences a tremor in right hand from time to time 2+2 context window
vector addition
IVexperiences⟨0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0⟩
IVa⟨0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0⟩
IVright
⟨0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0⟩ IVin
⟨0,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0⟩
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
1. Medical terminology development
‣ handling lexical variability by identifying synonyms
‣ distributionally similar words are candidate synonyms
2. Assignment of diagnosis codes
‣ semantic space comprising diagnosis codes and words
‣ discover distributional similarities between words used in notes and diagnosis codes assigned to them
‣ recommend codes to assign based on words used in clinical notes
3. Creating features for named entity recognition
‣ exploit large amounts of unlabeled text data to support learning with small (labeled) datasets
Applications of Distributional Semantics
Henriksson, A. Semantic Spaces of Clinical Text — Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records. Licentiate Thesis of Philosophy, Department of Computer and Systems Sciences (DSV), Stockholm University, 2013.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
• Task: identify protected health information (PHI) in clinical text
• Proposed method presupposes the existence of two resources:
‣ an annotated (named entity) corpus
‣ a (large) unannotated corpus
100 records from 5 clinicsconsensus of 3 annotators
8 PHI classes:First Name, Last Name, Age,Health Care Unit, Location,
Full Date, Date Part,Phone Number
(1) (II)
Prototype Vectors for Named Entity Recognition
Henriksson, A., Dalianis, H., Kowalski, S. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
(1) (1I) (III)
Prototype Vectors for Named Entity Recognition
Henriksson, A., Dalianis, H., Kowalski, S. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
• Prototype vector: abstract representation of a target (named entity) class
• Centroid of annotated instances’ semantic vectors
‣ words annotated as belonging to a particular class
• Centroid defined as column-wise median values
‣ reduce impact of noise
d1 d2 d3 d4 d5 d6 d7 …John -14.5 12.0 0 110.1 -25.4 120.0 -2.2 …Anna -10.7 0 4.0 75.4 -5.5 12.0 0 …Peter 40.4 5.0 10.2 50.4 -5.0 42.5 -12.0 …
…Prototype -10.7 5.0 4.0 75.4 -5.5 42.0 -2.2
Prototype Vectors for Named Entity Recognition
Henriksson, A., Dalianis, H., Kowalski, S. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Prototype Vectors for Named Entity Recognition
• Use prototype vectors to generate features for all instances in dataset
• Binary features: True if cosine similarity between prototype vector and semantic vector > threshold
• Threshold based on distances between labeled instances’ semantic vectors and their corresponding prototype vector
• Set to maximize Fβ-score on training data
‣ Positive examples: instances belonging to one named entity
‣ Negative examples: instances belonging to all other named entities
Henriksson, A., Dalianis, H., Kowalski, S. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Prototype Vectors for Named Entity Recognition
• Comparison of two feature sets provided to the learning algorithm
‣ Baseline: set of standard orthographic and syntactic features
‣ Baseline features + distributional semantic features
• Significant improvements obtained with generated features
• Further improvements obtained when combining multiple semantic spaces created with different models
Context vectors
Order vectors
Direction vectors
Henriksson, A., Dalianis, H., Kowalski, S. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
• Traditional data sources
‣ Pre-marketing surveillance✦ Clinical trials
‣ Post-marketing surveillance✦ Spontaneous reports
• Emerging alternatives for post-marketing surveillance
‣ Electronic health records
‣ Social media
Analyzing Structured Data for Pharmacovigilance
Limited sample size and observation period
Largely under-reportedReliability and complianceNo exposure information
High recording ratePatient medical historyExposure information recorded
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Using EHRs is not Unproblematic!
ADE Diag1 Drug1 CM1 Diag2 Drug2 CM2 …
P1 yes 1 0 NA 0 1 5.6 …
P2 yes NA 0 12 1 0 ? …
P3 no 0 ? 30 1 NA NA …
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
ADE Diag1 Drug1 CM1 Diag2 Drug2 CM2 …
P1 yes 1 0 NA 0 1 5.6 …
P2 yes NA 0 12 1 0 ? …
P3 no 0 ? 30 1 NA NA …
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
ADE Diag1 Drug1 CM1 Diag2 Drug2 CM2 …
P1 yes 1 0 NA 0 1 5.6 …
P2 yes NA 0 12 1 0 ? …
P3 no 0 ? 30 1 NA NA …
F25.1
C10AA01
F25.2
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
ADE Diag1 Drug1 CM1 Diag2 Drug2 CM2 …
P1 yes 1 0 NA 0 1 5.6 …
P2 yes NA 0 12 1 0 ? …
P3 no 0 ? 30 1 NA NA …
Drug1 was prescribed to P3 more than once
CM2 was taken P2 more than once
F25.1
C10AA01
F25.2
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
• Adapting disproportionality methods for detecting ADE signals
• Using machine learning algorithms to detect missing ADE codes in patients’ medical history
ADE Detection Using Structured EHRs
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Adapting Disproportionality Methods to EHR Data
event other events
drug a b
other drugs c d
Disproportionality methods:PRR, ROR, BCPNN, GPS
Event level: count drug-event pairs
Patient level: count distinct patients who experienced the same drug-event pair
Disproportionality methods are widely applied for drug safety signal detection. They are designed for analyzing data from Spontaneous Reports (SRs). Electronic Patient Records (EPRs), in which comprehensive information of each patient is covered, are complementary information sources for detecting Adverse Drug Events.
Disproportionality methods
Data sourceStockholm EPR Corpus 700,000 patients 9,832 unique diagnosis codes 1,312 unique drugs
Applying Methods for Signal Detectionin Spontaneous Reports to Electronic Patient RecordsJing Zhao, Isak Karlsson, Lars Asker, Henrik BoströmDept. of Computer and Systems Sciences, Stockholm University, Sweden{ jingzhao, isak-kar, asker, henrik.bostrom }@dsv.su.se
SRs EPRs
Top 50 Top 100 Top 200O P M A P O P M A P O P M A P
PRR (EL) 0.12 0.25 0.14 0.20 0.12 0.17PRR (PL) 0.14 0.16 0.08 0.15 0.08 0.12
ROR (EL) 0.14 0.26 0.15 0.21 0.12 0.19ROR (PL) 0.16 0.18 0.10 0.17 0.09 0.13
BCPNN (EL) 0.20 0.26 0.15 0.25 0.12 0.20BCPNN (PL) 0.14 0.13 0.11 0.14 0.09 0.12
GPS (EL) 0.20 0.27 0.15 0.24 0.12 0.20GPS (PL) 0.04 0.17 0.10 0.10 0.08 0.10
How the disproportionality methods are applied, event-level or patient-level, has a higher impact on the performance than which disproportionality method is employed.
OP: Overall Precision
MAP: Mean Average Precision
report 1
report 2
report 3
patient 1
patient 2
patient 3
patient 4
drugevent
a bc d
• PRR = aa + b / c
c + d
• ROR = ab / c
d
•
•
BCPNN
GPS
Event y Other eventsDrug x
Other drugs
} Bayesian approaches
Two counting strategies
Event-Level (EL): count drug-event pairs
Patient-Level (PL): count distinct patients who experienced the same drug- event pair
Zhao, J., Karlsson, I., Asker, L. and Boström, H. Applying Methods for Signal Detection in Spontaneous Reports to Electronic Patient Records. In 19th Knowledge Discovery and Data Mining (KDD) Conference’s Workshop on Data Mining for Healthcare (DMH), August 11-14, 2013, Chicago, USA.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Adapting Disproportionality Methods to EHR Data
0.00#
0.05#
0.10#
0.15#
0.20#
0.25#
PRR# ROR# BCPNN# GPS#
OP#(EL)#
OP#(PL)#
MAP#(EL)#
MAP#(PL)#
Zhao, J., Karlsson, I., Asker, L. and Boström, H. Applying Methods for Signal Detection in Spontaneous Reports to Electronic Patient Records. In 19th Knowledge Discovery and Data Mining (KDD) Conference’s Workshop on Data Mining for Healthcare (DMH), August 11-14, 2013, Chicago, USA.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Using Machine Learning Algorithms to Detect Missing ADE Codes in Patients’ Medical History
Headache!Drug induced headache!
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Two Studies: Experimental Setup
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Two Studies: Experimental Setup
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Exploiting Concept Hierarchies of Clinical Codes
C: Cardiovascular system
C10: Lipid modifying agents
C10A: Lipid modifying agents, plain
C10AA: HMG CoA reductase inhibitors
C10AA01: Simvastatin
F: Mental and behavioural disorders
F2: Schizophrenia, schizotypal and delusional disorders
F25: Schizoaffective disorders
F25.1: Schizoaffective disorder, depressive type
ATC
C
C10 F
C10A F2
C10AA F25
C10AA01 F25.1
Bottom-Up Level-WiseFeature Addition
ICD Top-Down Level-WiseFeature Addition
TD L1
TD L2
TD L3
TD L4
TD L5
BU L1
BU L2
BU L3
BU L4
BU L5
Zhao, J., Henriksson, A. and Boström, H. Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes. In Proceedings of IEEE International Conference on Healthcare Informatics (ICHI), September 15-17, 2014, Verona, Italy.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
54
32
1Bottom−Up vs Top−Down
Levels in Feature Set
Mea
n R
ank
1 Level 2 Levels 3 Levels 4 Levels All Levels
Bottom−Up (accuracy)Bottom−Up (AUC)Top−Down (accuracy)Top−Down (AUC)
Exploiting Concept Hierarchies of Clinical Codes
Zhao, J., Henriksson, A. and Boström, H. Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes. In Proceedings of IEEE International Conference on Healthcare Informatics (ICHI), September 15-17, 2014, Verona, Italy.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Multiple Representations of Clinical Measurements
Majority vote Baseline
Mean Average of repeated measurements
Standard deviation SD of repeated measurements
Slope Difference between first and last measurement over time span
Existence Whether a measurement exists
Count How many times a measurement was repeated
Combined Combination of above
Zhao, J., Henriksson, A., Asker, L., Boström, H. Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Accuracy AUCMajority vote 79.33 0.5
Mean 81.97 0.67Standard deviation 81.41 0.66
Slope 80.72 0.63Existence 82.30 0.68
Count 82.67 0.70Combined 82.92 0.72
Multiple Representations of Clinical Measurements
Zhao, J., Henriksson, A., Asker, L., Boström, H. Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
7678
8082
84
Performance of Multiple Classifiers with Different Representations
1:8
Accu
racy
(%)
OneR J48 JRip SMO IBk Ada Bag RF
Mean SD Slope Existence Count Combined0.
500.
600.
70
Classifier
AUC
OneR J48 JRip SMO Log IBk Ada Bag NB RF
Mean SD Slope Existence Count Combined
Multiple Representations of Clinical Measurements
Zhao, J., Henriksson, A., Asker, L., Boström, H. Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Combining Structured and Unstructured EHRs
Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records
Thanks for your attention!A short demo
aDEX & aDET