strata rx 2013 - data driven drugs: predictive models to improve product quality in pharmaceuticals
DESCRIPTION
Like most of healthcare and life science, pharmaceutical companies are undergoing a data-driven transformation. The industry-wide need to reduce the cost of developing, manufacturing and distributing drugs while bringing to market new products is not a novel concept or challenge. However, the ability to process and analyze large amounts of data using cutting-edge massively parallel processing (MPP) technologies means innovation can be found not only in the traditional hypothesis-driven approaches we have come to expect. New technologies and approaches make it possible to incorporate all available data, structured and unstructured. At Pivotal, it is the goal of our data science practice to demonstrate the capabilities of the technologies we offer. We focus on building predictive models by combining the vast and variable data that is available to elicit action or generate insights. In our talk we will focus on a use case in pharmaceutical manufacturing, wherein we created a predictive model to produce more consistent, high-quality products and drive decisions to abandon lots with expected poor outcomes. In addition, we demonstrate how we used machine learning to cleanse data and to improve efficiencies in data collection by identifying low information-content measurements and incorporate under-utilized data sources in manufacturing. Beyond this use case, we will discuss our vision of using machine learning in all areas of the industry, from research through distribution, to drive change.TRANSCRIPT
A NEW PLATFORM FOR A NEW ERA
2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved.
Data Driven Drugs: Predictive Models to Improve Product Quality in Pharmaceuticals Sarah Aerni, PhD Senior Data Scientist at Pivotal [email protected] Strata RX September 26, 2013
3 © Copyright 2013 Pivotal. All rights reserved.
The Quantified Patient
Medications!
Family !History!
Molecular!Diagnostics!Lab tests!
Clinical!Narratives!
Imaging!
Environment!
Genetics!Medical History!
Sensors!& Mobile!
4 © Copyright 2013 Pivotal. All rights reserved.
RICH DATA SOURCES ! Molecular data
– Cellular drug screens – Animal models
! Clinical data including notes, images, markers (e.g. genomics, lab results)
! Sensor and assay data ! Internal and partner/purchased external
data
! Contact center data ! Patient registries, public and federal
data, clinical partnerships
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
Data driven drugs: From discovery to delivery
5 © Copyright 2013 Pivotal. All rights reserved. 5 © Copyright 2013 Pivotal. All rights reserved.
Data integration How Pivotal can enable industries to
extract new value from data sources
6 © Copyright 2013 Pivotal. All rights reserved.
Successful transformation into a data-driven enterprise requires a paradigm shift
! Bring available data sources to a central location
Integration of a variety of data leads to new insights
! Analyze large volumes of variable data for richer models
Building models without data movement reduces time to insight
! Share data, insights and ideas Leveraging various expertise will lead to more relevant business insights Data > Application!
DATA IS THE NEW CENTER OF GRAVITY
7 © Copyright 2013 Pivotal. All rights reserved.
Traditional Analytics Processes If you think databases are only good for storing data
Time-to-Insights
sample
forecast
In-memory statistics
tool
In-memory optimization
tool solution
8 © Copyright 2013 Pivotal. All rights reserved.
Cloud Fabric
Data Fabric Application Fabric
Scale-out storage: HDFS/Object
Languages &
Frameworks
Ingest & Query: very high-capacity & in-memory Analytics Services
Cloud Abstraction (portability)
Automation: App Provisioning & Life-cycle
Service Registry
Pivotal One: Heritage
vFabric GemFire
9 © Copyright 2013 Pivotal. All rights reserved.
Loading
Performance Through Parallelism ! Automatic parallelization
– Load and query like any database – Automatically distributed tables across
nodes – No need for manual partitioning or tuning
! Analytics Optimized: – Analytics-oriented query optimization
! Extremely scalable MPP shared-nothing architecture
– All nodes can scan and process in parallel – Linear scalability by adding nodes
Interconnect
Database
Storage
Compute
10 © Copyright 2013 Pivotal. All rights reserved.
Loading
ETL File Systems
Performance Through Parallelism ! Automatic parallelization
– Load and query like any database – Automatically distributed tables across
nodes – No need for manual partitioning or tuning
! Analytics Optimized: – Analytics-oriented query optimization
! Extremely scalable MPP shared-nothing architecture
– All nodes can scan and process in parallel – Linear scalability by adding nodes
Interconnect
Database
Storage
Compute
External Sources: Loading, streaming, etc.
11 © Copyright 2013 Pivotal. All rights reserved.
Pivotal HD Architecture
HDFS
HBase
Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource Management & Workflow
Yarn
Zookeeper
Deploy, Configure,
Monitor, Manage
Command Center
Hadoop Virtualization (HVE)
Data Loader
Pivotal HD Enterprise
Apache Pivotal HD Enterprise
12 © Copyright 2013 Pivotal. All rights reserved.
Pivotal HD Architecture
HDFS
HBase
Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource Management & Workflow
Yarn
Zookeeper
Deploy, Configure,
Monitor, Manage
Command Center
Hadoop Virtualization (HVE)
Data Loader
Pivotal HD Enterprise
Apache Pivotal HD Enterprise HAWQ
Xtension Framework
Catalog Services
Query Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ– Advanced Database Services
13 © Copyright 2013 Pivotal. All rights reserved.
Decision support Precision care Cohort identification
Leveraging healthcare data to drive predictive and precision care
Labs test!
Genetics!
Environment!Medications!
Clinical!Narratives!
Imaging!
Unified data supporting unified risk evaluation, decision-making, etc. ! Acting on full patient and medical profile!
14 © Copyright 2013 Pivotal. All rights reserved.
Traditional Analytics Processes If you think databases are only good for storing data
Time-to-Insights
sample
forecast
In-memory statistics
tool
In-memory optimization
tool solution
15 © Copyright 2013 Pivotal. All rights reserved.
Analytics with Pivotal A single address for everything analytics
Time-to-Insights Forecasting
Regression Classification
Optimization
Clustering
16 © Copyright 2013 Pivotal. All rights reserved.
Analytics Ecosystem
PL/R,&PL/Python&PL/Java&
SAS/ACCESS&SAS&Scoring&Accelerator&SAS&High&Performance&
Analy7cs&
In0database&analy6cs&
M A D l i b
C O M M E R C I A L OP E N SO U R C E
17 © Copyright 2013 Pivotal. All rights reserved.
MADlib: Machine Learning at Scale
Collaborators
18 © Copyright 2013 Pivotal. All rights reserved.
! Molecular data – Cellular drug screens – Animal models
! Clinical data including notes, images, markers (e.g. genomics, lab results)
! Sensor and assay data ! Internal and partner/purchased
external data ! Contact center data ! Patient registries, public and
federal data, clinical partnerships
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
Data driven drugs: From discovery to delivery
19 © Copyright 2013 Pivotal. All rights reserved. 19 © Copyright 2013 Pivotal. All rights reserved.
Manufacturing Data-driven approaches to tuning a
drug manufacturing process
20 © Copyright 2013 Pivotal. All rights reserved.
Customer
A major pharmaceutical company
Business Problem
Predict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.
Challenges
• Customer’s data model was not optimal for running analytical queries
• Manual data quality issues
• Data capture was performed with varying consistency due to high cost associated with manual data collection
Solution
• Introduced a new data model to make data accessible and enable analytics
• Built automated outlier detection/correction methods to address manual data entry quality issues
• Devised imputation methods to deal with data completeness issues
• Built predictive models with high accuracy
Predicting potency in vaccine manufacturing
21 © Copyright 2013 Pivotal. All rights reserved.
Building predictive models to improved outcomes in manufacturing of vaccines
Future Looking Predictive Models
Duration of step
Time
Cou
nts
Backward Looking Models
Warning! Entered value not in expected range
Cell expansion
Virus propagation
Pooling into final product
Temp
22 © Copyright 2013 Pivotal. All rights reserved.
Enabling predictive models through rearchitecting
Cell expansion
Virus propagation
Pooling into final product
Challenges • Accessibility
– Certain parts of the data have never been used in any predictive modeling since it is extremely hard to query them
• Data Integrity – Manual data entries are prone to
errors. There is no immediate feedback to examine the validity of the values entered
• Data Completeness – Manual data entry is time
consuming. There is no feedback on what data is most useful in improving the efficiency and quality and hence no prioritization of what data should be collected
23 © Copyright 2013 Pivotal. All rights reserved.
Challenges • Accessibility
– Certain parts of the data have never been used in any predictive modeling since it is extremely hard to query them
• Data Integrity – Manual data entries are prone to
errors. There is no immediate feedback to examine the validity of the values entered
• Data Completeness – Manual data entry is time
consuming. There is no feedback on what data is most useful in improving the efficiency and quality and hence no prioritization of what data should be collected
Purpose-built data models for rapid data querying and exploration
Automated data cleansing techniques
Opportunities to eliminate collection of incomplete or non-predictive data
Enabling predictive models through rearchitecting
24 © Copyright 2013 Pivotal. All rights reserved.
Creating automated methods for detection and correction Identifying and correcting data integrity problems
! Data integrity problems cause challenges in modeling
! Sources of variation in entries of measurements
– Variable units of measurement
– Manual data entry errors
Approach: Detect the optimal threshold to separate two distributions
1 3 5 7 9 11 13 15 17 19 21 23
all data0
2040
6080
100
25 © Copyright 2013 Pivotal. All rights reserved.
Creating automated methods for detection and correction Identifying and correcting data integrity problems
! Data integrity problems cause challenges in modeling
! Sources of variation in entries of measurements
– Variable units of measurement
– Manual data entry errors
! Approach: Detect the optimal threshold to separate two distributions
1 3 5 7 9 11 13 15 17 19 21 23
all data0
2040
6080
100
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Freque
ncy
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
26 © Copyright 2013 Pivotal. All rights reserved.
Creating automated methods for detection and correction Identifying and correcting data integrity problems
Background Foreground
1 3 5 7 9 11 13 15 17 19 21 23
all data0
2040
6080
100
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Freque
ncy
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
27 © Copyright 2013 Pivotal. All rights reserved.
1 3 5 7 9 11 13 15 17 19 21 23
all data0
2040
6080
100
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Freque
ncy
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
Creating automated methods for detection and correction Identifying and correcting data integrity problems
Background Foreground
28 © Copyright 2013 Pivotal. All rights reserved.
Creating automated methods for detection and correction Identifying and correcting data integrity problems
Histogram of c(loh, uph)
c(loh, uph)
Frequency
12 14 16 18 20 22 24
020
4060
80
12 12 14 14 16 16 18 18 20 20 22 22 24
cleaned histogram with multiplier = 100
020
4060
80
1 3 5 7 9 11 13 15 17 19 21 23
all data0
2040
6080
100
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Freque
ncy
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
upper half
newVals[seq(maxBreak + 1, length(newVals), 1)]
Frequency
12 14 16 18 20 22 24
010
2030
4050
60
lower half
newVals[seq(1, maxBreak, 1)]
Frequency
0.12 0.14 0.16 0.18 0.20 0.22
05
1015
20
29 © Copyright 2013 Pivotal. All rights reserved.
Building models: First, start with the answer How to build models that solve the right problem
! Model form, how do we pick the right one? – How do we deal with correlated features? – Accuracy or interpretability?
! Available data – Thousands of features, without expert guidance how do we
choose the right ones? – What data do we want to use to predict? When is the right
time for an intervention?
Cell expansion
Virus propagation
Pooling into final product
Approach: Use historical data to build a model predicting potency of a final product using data from the manufacturing process
30 © Copyright 2013 Pivotal. All rights reserved.
Model generation and evaluation Predicting vaccine potency using manufacturing data
! Feature engineering and transformation – Enabled by rapid in-database processing
! Experimentation with model forms – Partial least squares – Random forest – Regularized regression
! Interpretation of model results for insight generation
– Use cross-validation framework to assess variable importance
True Potency
Pre
dict
ed P
oten
cy
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●●
●
●
●●
●●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
12.0 12.5 13.0 13.5
12.0
12.5
13.0
13.5
Total test 0.742003189411406
allTest[, i]
pred
Test
[, i]
Test R2=0.742 Train R2=0.823
31 © Copyright 2013 Pivotal. All rights reserved.
Sample model insights Interpreting the utility of a measure obtained during manufacturing based on model outcomes
0.20 0.25 0.30 0.35 0.40 0.45
12.0
12.2
12.4
12.6
12.8
13.0
Correlation = -0.45
SP1 Total Viable Cells Harvested Per Sq. Cm
Log
of P
oten
cy
Assayed value
12 12.5 13 13.5 14 14.5 15 15.5 >=16
12.0
12.2
12.4
12.6
12.8
13.0
Correlation = 0.38
SP2 Total Trypsinization Exposure Time of per CCS
Log
of P
oten
cy
Duration of a step
! Some features may reveal tunable parameters to alter potency, others may simply be markers
! Features consistently absent from models may be uninformative for predicting potency
! Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes
Pot
ency
Pot
ency
32 © Copyright 2013 Pivotal. All rights reserved. 32 © Copyright 2013 Pivotal. All rights reserved.
Data-driven drugs Opportunities for data mining across the
pharmaceutical industry
33 © Copyright 2013 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
34 © Copyright 2013 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery ! Data repurposing
New value exists in leveraging historical data across drugs and stages
! Data discovery External and publicly available datasets can augment proprietary sources
! Data collection Obtaining new data from different sources drives additional value
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
35 © Copyright 2013 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery ! Data repurposing
New value exists in leveraging historical data across drugs and stages Adverse events for new clinical indications
! Data discovery External and publicly available datasets can augment proprietary sources Twitter data to forecast demand
! Data collection Obtaining new data from different sources drives additional value Mobile and sensor data to measure patient adherence and outcomes
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
36 © Copyright 2013 Pivotal. All rights reserved.
Supply Distr. Patients
Publicly Available Resources Monitoring Patient Populations
Self-Reporting
Leveraging Data to Improve Demand Forecasts
Sales Data Analyze orders from
customers
Hospitals
Doctor’s Offices
Surgery Centers
Laboratories
Pharmacies
37 © Copyright 2013 Pivotal. All rights reserved.
Use of telehealth to provide tight glucose control Promising Advancements in Diabetes Studies
Intervention
EMR
Biochemical Measurements
Genomics
Lifestyle
38 © Copyright 2013 Pivotal. All rights reserved.
Multiple potential points of failure, requires use of analytics at every step Launching a successful diabetes management program
Increase Awareness
Patient Enrollment
Comparative Effectiveness
Remote Patient
Monitoring Design
Interventions Measure
Impact on Population
Campaign optimization
Identify influencers
Identify highest impact channels
Stochastic entity
resolution
A/B testing to design best engagement platform
Best channel per cohort Resource
allocation decisions
Predict risk of negative
outcome for next 3 months
Best therapy for each cohort:
• Medication • Delivery
Method • Monitoring
Method
Medication adherence
Careful design of experiment to
quantify the Impact
Measure engagement
Attribution models
Churn prediction
39 © Copyright 2013 Pivotal. All rights reserved.
Interdisciplinary collaboration of data scientists essential to success Launching a successful diabetes management program
Increase Awareness
Patient Enrollment
Comparative Effectiveness
Remote Patient
Monitoring Design
Interventions Measure
Impact on Population
Campaign optimization
Identify influencers
Identify highest impact channels
Stochastic entity
resolution
A/B testing to design best engagement platform
Best channel per cohort Resource
allocation decisions
Predict risk of negative
outcome for next 3 months
Best therapy for each cohort:
• Medication • Delivery
Method • Monitoring
Method
Medication adherence
Careful design of experiment to quantify the
Impact
Measure engagement
Attribution models
Churn prediction
Marketing Healthcare Web Analytics Optimization General ML
40 © Copyright 2013 Pivotal. All rights reserved.
Pivotal Labs rapid application development
! Rheumatoid arthritis remote patient monitoring system
– Self-reporting – Intuitive user interface
https://itunes.apple.com/us/app/myra/id563338979?mt=8
41 © Copyright 2013 Pivotal. All rights reserved.
Cloud Fabric
Data Fabric Application Fabric
Scale-out storage: HDFS/Object
Languages &
Frameworks
Ingest & Query: very high-capacity & in-memory Analytics Services
Cloud Abstraction (portability)
Automation: App Provisioning & Life-cycle
Service Registry
Pivotal One: Heritage
vFabric GemFire
A NEW PLATFORM FOR A NEW ERA