how machine learning can automate data standardization in ... · linear svm 0.95 6.6 sgd classifier...
TRANSCRIPT
1© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
How Machine Learning Can Automate Data Standardization in Clinical Trials
Fanyi Zhang, Medidata Solutions, Inc.November, 2018 at Phuse, Frankfurt
2© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
The Big Data Revolution in Healthcare and Clinical Research?
70 - 80%
20 - 30%
Disparate Data
Solutions
?
Analyze
Decide
Ideate
3© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Demographics
Gender
Age
Standardizing clinical data is mandatory by FDA
“In 2014, FDA mandated that data collection in clinical trials adhere to Study Data Tabulation Model (SDTM) developed by Clinical Data Interchange Standards Consortium (CDISC)”
eCRF Example*: Adverse Event
eCRF Example*: Demographics 1
eCRF Example*: Demographics 2
*Screenshots from Medidata Rave
4© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
>14K >4M 57Kstudies trial
subjectssponsor/site relationships
MEDS: One of the Industry’s Largest Give to Get Clinical and Operational Data
99% in study count*
73% in study count*
90% in number of study sites*
*Growth over 3 year period
5© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
StartDate
Analytics
AE Name
dB’s
Standardize Data Across Multiple Clinical TrialsFrom Rule-Based Engines to Auto-SDTM
AdverseEvents
Study Exposure
Dose
Domains Variables
TreatmentName
30+ Domains, 155 + Variables100K+ Case Report Forms
Age Unit
AgeDemographics
6© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
MEDS-SDTM Standard30+ Domains, 155+ Variables
Age
Demogra-phics
Age
Uni
t
Gen
der
Rac
e
Ethn
icity
Term
Ver
batim
Adverse Event
Seve
rity
Out
com
e
Act
ion
Take
n
Star
t Dat
e
Trea
tmen
t N
ame
Study Exposure
Star
t Dat
e
End
Dat
e
Dos
e
Dos
e U
nit
Lab MedicalHistory
Biomarker
7© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Faster in speed, Better in quality
Learning
Supervised-learning
F1
F2
F3
F4
Model assessment and interpretation
Human experts review and annotate
Human (Expert) in the Loop Machine Learning
Data Source(s)
Human experts formulate definitions
Rule-based labeling +
human annotations
8© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Create Initial Labels
Find Demographic forms
Search for form names that contain “demographic” /
“DM” etc.
Search for field names that contain “age” etc.
Find Age fields
Example from DM domain
1
Data Source(s)
Human experts formulate definitions
Rule-based labeling +
human annotations
9© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Method* CV Accuracy Build time (minutes)Logistic Regressionw/ regularization 0.95 7.5
Linear SVM 0.95 6.6
SGD Classifier 0.95 7.7
Random Forest 0.96 7.8
XGBoost Classifier 0.96 5.1
Form Classifier in 3 fold Cross Validation
Supervised-learning
Bag of Words Feature transformation Model selection2
10© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Predicted ProbabilitiesKeywords RankingRaw Text with Highlighted Words
Exposure FormNot Exposure Form
Exposure FormVital Signs Form
………
11© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
F1F2F3F4
Model assessment and interpretation
1
23
Domain 1
Domain 2
Domain 3
Do we need more labeled data ?
Mod
el P
erfo
rman
ce
(F1-
scor
e)
3
Sample Count
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
12© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Both, Classifier and RBL* agree
Classifier = RBL* = VS
Text in the raw data …
Human experts review and annotate
4
*RBL: Rule-based labels**VS: vital sign forms
Classifier is correct and RBL* is incorrectHuman experts review and annotate
4
Classifier = LB**, RBL* = OTH**
*RBL: Rule-based labels**LB: study exposure forms**OTH: other / unlabeled forms
hematologyText in the raw data …
NOT LB LB
14© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Classifier is incorrect and RBL* is correct
Classifier = LB**, RBL* = BI**
Human experts review and annotate
4
Text in the raw data …
*RBL: Rule-based labels**LB: lab forms**BI: biomarker forms
15© 2018 Medidata Solutions, Inc. – Proprietary and Confidential
Learning 5
The i th Review Round
1
0.95
0.9
0.85
0.8
0.75
0.70
Mod
el P
erfo
rman
ce
(F1-
scor
e)
Model improved continuously over 4 rounds of iterations
r1 r2 r3 r4