analyzing structured and unstructured data in electronic

Data Science Workshop at Stockholm University: Jing Zhao, Aron Henriksson - December 4, 2014Analyzing Structured and Unstructured Data in Electronic Health Records

Analyzing Structured and Unstructured Data

in Electronic Health Records

Jing Zhao & Aron HenrikssonDepartment of Computer and Systems Sciences (DSV)

Stockholm University


• Administrative and clinical patient data collected systematically and routinely

• Capture and integrate data on all aspects of care over time

• Comprise various data types

• Sensitive data

Electronic Health Records


• Administrative and clinical patient data collected systematically and routinely

• Capture and integrate data on all aspects of care over time

• Comprise various data types

• Sensitive data

Electronic Health Records

• Health records from Karolinska University Hospital

‣ 5 years: 2006-2010

‣ A large variety of clinics in the Stockholm region

‣ 1M patients

‣ 10K diagnosis codes, 3M diagnoses

‣ 1.3K drugs, 2M prescriptions

‣ 700 clinical measurement types, 15M measurements

‣ 4M token types, 1.5B tokens


• Under-reporting (especially under-coding) still exists

• High dimensionality and sparsity

• Integrating various types of data

• Clinical notes noisy

Benefits & Challenges of Using EHRs

• Large amounts of longitudinal observations

• Holistic perspective of patients’ clinical conditions

• Systematically collected and archived data

• Clinical notes complementing structured information


Clinical Text - A Peculiar Genre• A flexible communication tool, documenting healthcare across

‣ space — between clinicians

‣ time — “memos”

• Written under serious time constraints


Clinical Text - A Peculiar Genre• A flexible communication tool, documenting healthcare across

‣ space — between clinicians

‣ time — “memos”

• Written under serious time constraintstelegraphic, ungrammatical sentences non-standard shorthand misspellings

synonyms multiword terms negation


Clinical Text - A Peculiar Genre

Types Tokens Type/Token Ratio

General Corpus

SUC 3.0 0.1M 1M 0.1

Medical Corpora

Vårdguiden 0.05M 2.8M 0.0160

Läkartidningen 0.4M 20M 0.0230

Stockholm EPR Corpus 3.9M 1582M 0.0024

Stockholm EPR Corpus: 72% verbless and 10% subjectless sentences!

11% in SUC 3.0 15% in Läkartidningen

0.4% in SUC 3.0 0.05% in Läkartidningen


Clinical Language Processing: Challenges

part-of-speech taggers named entity recognizers

syntactic parsers co-reference resolvers

general domain clinical domain


Clinical Language Processing: Challenges• Text data, modeled as a bag of words, is very high-dimensional and sparse

‣ 4M types 4M features?

‣ Exacerbated by lexical variability of concepts

• Alternatives to the bag-of-words approach?

patient patiant pat pathological fever pain cold …Ex1 3 1 0 2 1 0 2 …

Ex2 0 0 1 1 0 0 1 …

Ex3 1 0 0 0 3 0 0 …

synonymyhomonymy

patient pathological


Distributed Word Representations• Model lexical semantics based on co-occurrence information

‣ words that appear in similar contexts tend to have similar meanings (Zellig Harris, John R. Firth, Ludwig Wittgenstein, …)

• Vector representations in reduced-dimensional semantic space (d << vocabulary size)

itchy asthma chemotherapy …allergy 87 44 2 …

hypersensitivity 46 55 4 …

cancer 0 8 84 …


Random Indexing• Efficient and scalable algorithm for creating semantic spaces

‣ No dimensionality reduction applied to term-term matrix

‣ Incrementally obtains co-occurence information in pre-reduced semantic vectors

• Two types of vectors (of dimensionality d << vocabulary size):

‣ index vectors — used only in construction phase

‣ semantic vectors — word meaning representations; make up semantic space

• Each unique word wj assigned an index vector IVj and a semantic vector SVj

‣ IV: sparse, randomly assigned a small number of 1s and -1s

‣ SV: sum of IVs of the words that wj co-occurs with within a given window size


⟨0,0,0,1,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,-1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,-1,0,0,0,0,-1,0,0,0,0⟩SVtremor

patient experiences a tremor in right hand from time to time 2+2 context window

vector addition

IVexperiences⟨0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0⟩

IVa⟨0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0⟩

IVright

⟨0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0⟩ IVin

⟨0,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0⟩


1. Medical terminology development

‣ handling lexical variability by identifying synonyms

‣ distributionally similar words are candidate synonyms

2. Assignment of diagnosis codes

‣ semantic space comprising diagnosis codes and words

‣ discover distributional similarities between words used in notes and diagnosis codes assigned to them

‣ recommend codes to assign based on words used in clinical notes

3. Creating features for named entity recognition

‣ exploit large amounts of unlabeled text data to support learning with small (labeled) datasets

Applications of Distributional Semantics

Henriksson, A. Semantic Spaces of Clinical Text — Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records. Licentiate Thesis of Philosophy, Department of Computer and Systems Sciences (DSV), Stockholm University, 2013.


• Task: identify protected health information (PHI) in clinical text

• Proposed method presupposes the existence of two resources:

‣ an annotated (named entity) corpus

‣ a (large) unannotated corpus

100 records from 5 clinicsconsensus of 3 annotators

8 PHI classes:First Name, Last Name, Age,Health Care Unit, Location,

Full Date, Date Part,Phone Number

(1) (II)

Prototype Vectors for Named Entity Recognition

Henriksson, A., Dalianis, H., Kowalski, S. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.


(1) (1I) (III)




• Prototype vector: abstract representation of a target (named entity) class

• Centroid of annotated instances’ semantic vectors

‣ words annotated as belonging to a particular class

• Centroid defined as column-wise median values

‣ reduce impact of noise

d1 d2 d3 d4 d5 d6 d7 …John -14.5 12.0 0 110.1 -25.4 120.0 -2.2 …Anna -10.7 0 4.0 75.4 -5.5 12.0 0 …Peter 40.4 5.0 10.2 50.4 -5.0 42.5 -12.0 …

…Prototype -10.7 5.0 4.0 75.4 -5.5 42.0 -2.2





• Use prototype vectors to generate features for all instances in dataset

• Binary features: True if cosine similarity between prototype vector and semantic vector > threshold

• Threshold based on distances between labeled instances’ semantic vectors and their corresponding prototype vector

• Set to maximize Fβ-score on training data

‣ Positive examples: instances belonging to one named entity

‣ Negative examples: instances belonging to all other named entities




• Comparison of two feature sets provided to the learning algorithm

‣ Baseline: set of standard orthographic and syntactic features

‣ Baseline features + distributional semantic features

• Significant improvements obtained with generated features

• Further improvements obtained when combining multiple semantic spaces created with different models

Context vectors

Order vectors

Direction vectors



• Traditional data sources

‣ Pre-marketing surveillance✦ Clinical trials

‣ Post-marketing surveillance✦ Spontaneous reports

• Emerging alternatives for post-marketing surveillance

‣ Electronic health records

‣ Social media

Analyzing Structured Data for Pharmacovigilance

Limited sample size and observation period

Largely under-reportedReliability and complianceNo exposure information

High recording ratePatient medical historyExposure information recorded


Using EHRs is not Unproblematic!

ADE Diag1 Drug1 CM1 Diag2 Drug2 CM2 …

P1 yes 1 0 NA 0 1 5.6 …

P2 yes NA 0 12 1 0 ? …

P3 no 0 ? 30 1 NA NA …



P1 yes 1 0 NA 0 1 5.6 …

P2 yes NA 0 12 1 0 ? …

P3 no 0 ? 30 1 NA NA …



P1 yes 1 0 NA 0 1 5.6 …

P2 yes NA 0 12 1 0 ? …

P3 no 0 ? 30 1 NA NA …

F25.1

C10AA01

F25.2



P1 yes 1 0 NA 0 1 5.6 …

P2 yes NA 0 12 1 0 ? …

P3 no 0 ? 30 1 NA NA …

Drug1 was prescribed to P3 more than once

CM2 was taken P2 more than once

F25.1

C10AA01

F25.2


• Adapting disproportionality methods for detecting ADE signals

• Using machine learning algorithms to detect missing ADE codes in patients’ medical history

ADE Detection Using Structured EHRs


Adapting Disproportionality Methods to EHR Data

event other events

drug a b

other drugs c d

Disproportionality methods:PRR, ROR, BCPNN, GPS

Event level: count drug-event pairs

Patient level: count distinct patients who experienced the same drug-event pair

Disproportionality methods are widely applied for drug safety signal detection. They are designed for analyzing data from Spontaneous Reports (SRs). Electronic Patient Records (EPRs), in which comprehensive information of each patient is covered, are complementary information sources for detecting Adverse Drug Events.

Disproportionality methods

Data sourceStockholm EPR Corpus 700,000 patients 9,832 unique diagnosis codes 1,312 unique drugs

Applying Methods for Signal Detectionin Spontaneous Reports to Electronic Patient RecordsJing Zhao, Isak Karlsson, Lars Asker, Henrik BoströmDept. of Computer and Systems Sciences, Stockholm University, Sweden{ jingzhao, isak-kar, asker, henrik.bostrom }@dsv.su.se

SRs EPRs

Top 50 Top 100 Top 200O P M A P O P M A P O P M A P

PRR (EL) 0.12 0.25 0.14 0.20 0.12 0.17PRR (PL) 0.14 0.16 0.08 0.15 0.08 0.12

ROR (EL) 0.14 0.26 0.15 0.21 0.12 0.19ROR (PL) 0.16 0.18 0.10 0.17 0.09 0.13

BCPNN (EL) 0.20 0.26 0.15 0.25 0.12 0.20BCPNN (PL) 0.14 0.13 0.11 0.14 0.09 0.12

GPS (EL) 0.20 0.27 0.15 0.24 0.12 0.20GPS (PL) 0.04 0.17 0.10 0.10 0.08 0.10

How the disproportionality methods are applied, event-level or patient-level, has a higher impact on the performance than which disproportionality method is employed.

OP: Overall Precision

MAP: Mean Average Precision

report 1

report 2

report 3

patient 1

patient 2

patient 3

patient 4

drugevent

a bc d

• PRR = aa + b / c

c + d

• ROR = ab / c

d

•

•

BCPNN

GPS

Event y Other eventsDrug x

Other drugs

} Bayesian approaches

Two counting strategies

Event-Level (EL): count drug-event pairs

Patient-Level (PL): count distinct patients who experienced the same drug- event pair

Zhao, J., Karlsson, I., Asker, L. and Boström, H. Applying Methods for Signal Detection in Spontaneous Reports to Electronic Patient Records. In 19th Knowledge Discovery and Data Mining (KDD) Conference’s Workshop on Data Mining for Healthcare (DMH), August 11-14, 2013, Chicago, USA.


Adapting Disproportionality Methods to EHR Data

0.00#

0.05#

0.10#

0.15#

0.20#

0.25#

PRR# ROR# BCPNN# GPS#

OP#(EL)#

OP#(PL)#

MAP#(EL)#

MAP#(PL)#

Zhao, J., Karlsson, I., Asker, L. and Boström, H. Applying Methods for Signal Detection in Spontaneous Reports to Electronic Patient Records. In 19th Knowledge Discovery and Data Mining (KDD) Conference’s Workshop on Data Mining for Healthcare (DMH), August 11-14, 2013, Chicago, USA.


Using Machine Learning Algorithms to Detect Missing ADE Codes in Patients’ Medical History

Headache!Drug induced headache!


Two Studies: Experimental Setup


Exploiting Concept Hierarchies of Clinical Codes

C: Cardiovascular system

C10: Lipid modifying agents

C10A: Lipid modifying agents, plain

C10AA: HMG CoA reductase inhibitors

C10AA01: Simvastatin

F: Mental and behavioural disorders

F2: Schizophrenia, schizotypal and delusional disorders

F25: Schizoaffective disorders

F25.1: Schizoaffective disorder, depressive type

ATC

C

C10 F

C10A F2

C10AA F25

C10AA01 F25.1

Bottom-Up Level-WiseFeature Addition

ICD Top-Down Level-WiseFeature Addition

TD L1

TD L2

TD L3

TD L4

TD L5

BU L1

BU L2

BU L3

BU L4

BU L5

Zhao, J., Henriksson, A. and Boström, H. Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes. In Proceedings of IEEE International Conference on Healthcare Informatics (ICHI), September 15-17, 2014, Verona, Italy.


54

32

1Bottom−Up vs Top−Down

Levels in Feature Set

Mea

n R

ank

1 Level 2 Levels 3 Levels 4 Levels All Levels

Bottom−Up (accuracy)Bottom−Up (AUC)Top−Down (accuracy)Top−Down (AUC)

Exploiting Concept Hierarchies of Clinical Codes

Zhao, J., Henriksson, A. and Boström, H. Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes. In Proceedings of IEEE International Conference on Healthcare Informatics (ICHI), September 15-17, 2014, Verona, Italy.


Multiple Representations of Clinical Measurements

Majority vote Baseline

Mean Average of repeated measurements

Standard deviation SD of repeated measurements

Slope Difference between first and last measurement over time span

Existence Whether a measurement exists

Count How many times a measurement was repeated

Combined Combination of above

Zhao, J., Henriksson, A., Asker, L., Boström, H. Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 2-5, 2014, Belfast, UK.


Accuracy AUCMajority vote 79.33 0.5

Mean 81.97 0.67Standard deviation 81.41 0.66

Slope 80.72 0.63Existence 82.30 0.68

Count 82.67 0.70Combined 82.92 0.72




7678

8082

84

Performance of Multiple Classifiers with Different Representations

1:8

Accu

racy

(%)

OneR J48 JRip SMO IBk Ada Bag RF

Mean SD Slope Existence Count Combined0.

500.

600.

70

Classifier

AUC

OneR J48 JRip SMO Log IBk Ada Bag NB RF

Mean SD Slope Existence Count Combined




Combining Structured and Unstructured EHRs


Thanks for your attention!A short demo

aDEX & aDET

https://drive.google.com/file/d/0B5E5Mh0dI65GeVNmazVxcHV1aEk/view?usp=sharing

analyzing structured and unstructured data in electronic

Documents