societal impact of applied data science on the big data stack
TRANSCRIPT
Product DecisionsthroughBig Data
Center for Data ScienceAnkur Teredesai
University of Washington Tacoma
1March 14th, 2015
• Bioinformatics• Health and Wellness
• Predictive Analytics
Health Informatics
• Distributed Systems• Databases• Geo-‐Spatial• Embedded Systems
Geo-‐Spatial Data Management
• Machine Learning• Data Mining• Computation Intelligence
• Computer Vision
Intelligent Systems
•Web• Devices•Mobile Networks• UX / UI
Social Computing
• Cryptology• Secure Machine Learning
Big Data Security
• Engineering• Dev-‐Ops
Big Data Infrastructure
Center for Data Science: Societal Impact
Machine LearningAnalytics
Engineering
Features
AlgorithmScalability
ELT
Integrate Sources
Constraints
Deploy ModelsAPIsApps
Data Struggles
A Big Data Project Blueprint:
3
Data Mining: 1989 -‐ 2010
• Data Science and Applications move and transform sizeable amounts of data out of the native database or file systems.
Applications
SQL/ODBC/JDBC Data Access
Distributed DatabaseMulti-Core, Columnar, Key-Value
Distributed DatabaseMulti-Core, Columnar, Key-Value
Distributed DatabaseMulti-Core, Columnar, Key-Value
Distributed DatabaseMulti-Core, Columnar, Key-Value
Data Science using R, SAS, SPSS, Weka, MAHOUT
HIGH
VOLUME
HIGH
LATENCY
HIGH
VOLUME
Application Ecosystem Integration
Data Science uses native data representation and inherent distribution and parallelism
Minimal data movement
Rapid Application development using data science constructs
5
Big Data Science
Application Ecosystem Integration
Applications
SQL/ODBC/JDBC Data Access
Data Science•Internal Algorithms for clustering, •classification, regression
Distributed DatabaseMulti-Core, Columnar, Key-Value
LOWER
VOLUME
LOWER
LATENCY
HIGH
VOLUME
LOW
LATENCYBig Data Science Components
A Short History of (Big) Data Technology1970: Codd invents “A Relational Model of Data for Large Shared
Data Banks”
1985: Copeland –Decomposition Storage Model (essentially the first Columnar Store)
1989: Shared-‐Nothing Architecture
2004: Google –MapReduce
2005: C-‐Store (Eventually Vertica),
layers WS/RS
2007: Materialization Optimizations in
Columnar Stores and Hadoop Implementation
2005-‐07: Star-‐Schema Benchmark+ Hadoop
2008: Attempts to backport columnar advances to row storage, not very
effective
Today: BIG DATA
Technology Decisions
7
Columnar Vs Relational Storage Technologies
Infinite scale using commodity hardware
Private or Public Cloud
Massively Distributed and Parallel Architecture: Hadoop
Stream Query Processing for trillions of events and petabytes of data
Real-time classification and clustering: Approximate scoring and segmentation + Reporting and Data Visualization
Flat Files CSV Claims X12 Clinical HL7
Distance Compute Library
Instance Selection
RNGE Drop 3
Fuzzy Rough Set Approximation
CHF Risk of Readmission
Geo Routing
Random Forests KNN
Industry Partners and Domain Experts
Other Solutions
HDFS NUMA
MPI Grappa
Census US Gov Unstructured CCD
Bayesian Networks
Support VectorMachines
8
Cost of Chronic Interventions
Age/Gender Prediction
Malware Analytics
Personalized Cancer Therapy
ETL Tools
Raw Data from Sources (SID, OSHPD, HCUP, Edifecs, MHS, CMS, LINCS, Industry)
Sqoo
p
iTornado
Routing Service With Real World Severe Weather
Demo Paper in ACM SIGSPATIAL 2014(Best Demo paper award)
Fatalities Stats byWeather Related Hazards http://www.nws.noaa.gov, June 2014.
COMA
Road Network Compression For Map Matching
ACM SigSpatial IWGS 2014
PreGo
Dynamic Multi-‐Preference Routing
Single Attribute
MultipleAttribute
Time-‐Homogenous Dijkstra, A* Stewart et al 91
Time-‐Variant Betsy et al 07 ?
<3,4>
<2,2>
<5,7>
<0,0>
a
s
b
e
T=[1,2,3,4,5]R=[1,2,3,4,5]
T=[1,2,3,4,5]R=[1,2,3,4,5]
d
c g
f h
T=[1,2,3,4,5]R=[1,2,3,4,5]
T=[1,2,3,4,5]R=[1,2,3,4,5]
T=[5,1,3,4,5]R=[7,1,2,4,5]
T=[1,1,3,4,5]R=[1,2,3,4,5]
T=[2,1,3,4,5]R=[2,1,3,4,5]
T=[1,2,2,4,3]R=[2,1,5,4,3]
T=[1,2,3,1,1]R=[1,2,3,0,1]
<1,1>
<4,4>
T=[4,2,1,3,5]R=[3,2,1,4,5]
Special Needs Education: Teacher Trainer Effectiveness AnalysisCustomized Surveys
Training Registration
Survey Management
To support streamlined data collection and performance evaluation across the State Needs
Projects.Project Stakeholders
Office of the Superintendent of Public Instruction
Center for Data Science
Data Dashboard Purpose Report Generation
Geographic Distribution Maps
Demographic Reports
Brad Porter, Aniruddha Desai, Yitao Li, David Hazel, Michelle Maike, Greg Benner, Ankur Teredesai, Leslie Pyper, Vickie Green
Systems Biology
13
Predictive Models and software
Applications: Personalized medicine, drug discovery
Focus: Develop machine learning methods and tools to effectively integrate multiple big data sources in biology.
A Flying Hadoop Cluster
14
Detecting Malware Activity based on Automatically Generated Domains
Command & Control
xyz.com xyz.com
Infected node
Partnering with NIARA we obtained a large dataset of Automatically Generated Domains.
Based on the intercepted domain features we are able to identify the malware infecting a network.
(March 2012)• Will this Heart Failure patient get readmitted within 30 days?• Yes or No (Binary Classification)
16
Reduce CHF Readmission
Readmission ?Machine Learning?
Joint NSF / NIH Solicitation on Health Care and Big Data
Affordable Care Act => Avoidable CostsReadmissions are AVOIDABLE
20%32%
30 days
60 days
75%
25% Non CHF
CHF
• Readmissions national cost $17 billion annually
• 76 % considered avoidable
17
Readmissions
Congestive Heart Failure (CHF)
Source: www.presidency.ucsb.edu, cdc.gov, tmz.com
Patient
Class Labels
No readmissionReadmission
CHF ROR: 30-‐Day Hospital Readmission Risk Prediction
Machine Learning Algorithms
18
Building the
model
Scoring the tuple
FeaturesVector
FeaturesVectors
New patient
No readmissionReadmission
19
Some of the Steps
Data UnderstandingAnd Integration
Data Cleaning
Data Transformation
Extracting data from Epic -‐16 data marts and 200 views:Heart Failure Inpatient SummaryEncounter.FlowsheetPatientEncounterHospital
vs
Public Data:State Inpatient Dataset 2009-‐2012
20
AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE DXCCS1 PRCCS1 TOTCHG52 98122 1 3 12 3 0 153 212 56,51187 98109 1 3 7 1 1 162 -‐ 12,68726 98028 4 3 1 30 1 139 195 127,300
• Washington State Inpatient Data• Admission level Claims • ~400 attributes
• Demographics• ICD9 Diagnosis codes• ICD9 Procedure codes• Charges
• Admissions by year• 2009 – 652702• 2010 – 651783• 2011 – 648079• 2012 – 648092
Variety and Volume (2/3 V’s of Big Data)
Pre Admission Post Admission Pre-‐ Discharge Discharge-‐ Demographics
-‐ Vital Sign
-‐Prior Hospitalization
Pulse rate Blood pressure Respiration rate BMI
Number of prior admissionsPrior length of stay
+ Demographics
Sodium levelGlucose levelHemoglobin levelCreatinine level
Hematocrit levelNeutrophils level
Ejection Fraction BUN level
+ Vital Sign+ Prior Hospitalization-‐ Lab Test
+ Vital Sign+ Prior Hospitalization
+ Demographics
+ Lab Test-‐ Diagnosis InformationNumber of secondary diagnosisChronic systolic heart failure Acute kidney failure Chest painHyper potassemia BronchopneumoniaOther chronic pulmonary heart diseases Syncope and collapse …
+ Prior Hospitalization+ Demographics
-‐ ComorbiditiesAcute coronary syndrome AsthmaCOPD Ulcer Dialysis DementiaArrhythmias Mal Nutrition Vascular Depression
-‐ Discharge/Admit codesAdmit /Discharge typeSeverity Of illness Risk Of Mortality
-‐ Utilization InformationOperating room CTSCANEmergency Room CCU
Marital status AgeRacial group Gender
(Dec 2012) Initial Models
22
Data integration
Feature Construction
Predictive modeling
• Logistic Regression• Naïve Bayes• Support Vector Machines
0.6
0.72
0.64
0.540.560.580.6
0.620.640.660.680.7
0.720.74
Yale Model (Comparative …
Amarasingham et al.
Our current Result
Area Und
er th
e Cu
rve (AUC)
Several Rejects: KDD Industry Track 2013AMIA 2013JAMIA 2013
2012
(July 2013) (much better) & Some Papers
§ Improved data exploration§ S.-‐C. Chin, K. Zolfaghar, S. Basu Roy, A.Teredesai, and P. Amoroso, "Divide-‐n-‐Discover -‐-‐ Discretization based DataExploration Framework forHealthcare Analytics," 7thInternational Conference on HealthInformatics (HEALTHINF Short Paper),Angers, France, 2014
§ N. Meadem, N. Verbiest, K. Zolfaghar,J. Agarwal, S.-‐C. Chin, S. Basu Roy, A.Teredesai, D. Hazel, P. Amoroso, andL. Reed, "Predicting Risk ofReadmission for Congestive HeartFailure Patients," Workshop on DataMining for Healthcare (DMH),Chicago, IL, 2013
23
0.6
0.720.64
0.74
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Yale Model (Comparative Baseline)
Amarasingham et al.
Our 2012 Result Our current Result
Area Und
er th
e Cu
rve (AUC)
§Improved Modeling Effort
(Dec 2013) Prototype or a possible Product? & yes, More Papers§ Successful Deployment
24
§K. Zolfaghar, J. Agarwal, D. Sistla, S.-‐C. Chin, S. Basu Roy, and N. Verbiest, "Risk-‐O-‐Meter: An IntelligentClinical Risk Calculator," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD), Chicago, IL, 2013
§Kiyana Zolfaghar, Naren Meadem, Ankur Teredesai, Senjuti Basu Roy, Si-‐Chi Chin, Brian Muckian: Bigdata solutions for predicting risk-‐of-‐readmission for congestive heart failure patients. BigDataConference 2013: 64-‐71
25
Multi Layer Classifier : Automatically Detecting Classification Windows
Will patient ever readmit?
Will patient readmitwithin 30 days?
YES NO
YES NO
KNN
LRNBSVMKNN
32% of all data
Only 5% of patients that returnwithin 30 days is filtered out
Generalizing the 30,60,90 Day readmission
§ Automatic design of time prediction hierarchy§ Feature selection and factor analysis at each layer§ Different classification algorithms in each layer and satisfying different quality metrics
26
Automatic design of prediction hierarchy
27
Simple 3 Layer Example
• Stage 1: Design a predictive model for the patients who are likely to come back within a time window of (X, K), where X is the maximum number of days until next readmission• Stage 2: Design a predictive model for time window of (K, 30)• Stage 3: Design a predictive model for time window of <30 days of readmission
HOW TO AUTOMATICALLY DETECT THE MIDDLE CUTPOINT K?
28
Hill Climbing Algorithm to Detect K
§Generate a random number K between X and 30§ Compute C1= Centroid(X,K) , C2= Centroid(K+1,30)§ Compute the KLCurrent = KLDiv(C1,C2)§ K’=K+i K”=K-‐i§ Find a point K2 between (K’,K’’) , and check§ If KLDiv( Centroid(X,K2), Centroid(K2,30)) > KLCurrent§ If the above condition is satisfied, then K=K2§ KLCurrent = KLDiv( Centroid(X,K2), Centroid(K2,30)) § Repeat the above steps until no further check is possible
29
30
Calculating the Probability of 30 day RoR
P(readmit ≤ 30) = P(≤ 30 |≤ K )×P(≤ K |Y )P(Y )
Risk-‐O-‐MeterDistinguishing Features
31
Risk-‐O-‐Meter
Users
Current Systems
Healthcare provider
and Patients
Only
healthcare providers
Result explanation
and exploration
Need deep domain Knowledge
Handle incomplete patient input
All in one Package – Risk-‐O-‐Meter (KDD 2013)
32
Pre Admission Post Admission Pre -‐ Discharge Discharge
Post-‐Discharge Care
Management Pipeline
“White Gap”PCP HF ServiceCare Management
Payer
ChroniRisk Continuous Readmission Risk Assessment Across Continuum of Care*
78%*
42%*
Service Line EMRPCP Tools Psycho-‐social risk scoring
2013 HF Readmission Statistics• 7.1 M Readmits• 5.3 M Avoidable• $13,000 each• $13 B opportunity cost
Patient Encounters Scored
+18,000 (HF cohort)
Risk – DoneCost – DoneNext?
Actionable InterventionsIf we can predict can we recommend?
34
A Framework to Recommend Interventions for 30-‐Day Heart Failure Readmission Risk, Rui Liu, KiyanaZolfaghar, SC Chin, Senjuti Basu Roy, Ankur Teredesai, Data Mining (ICDM), 2014 IEEE International Conference on DOI: 10.1109/ICDM.2014.89 Publication Year: 2014 , Page(s): 911 -‐ 916
A real and common Chronic Readmission
75-‐year old, female
Chronic pulmonary disease, depression, hypertensionand diastolic heart failure
High Risk
Medium Risk
Low Risk
35
Readmit!Intervention Plan 1
Major Operating Room, Chest X-‐ray and others
Intervention Plan 2
Echocardiology, CCU and others
Intervention Plan 3
Emergency Room and others
Risk will be lower when the interventions are performed
The patient is not readmitted
Intervention Rule Generation
Readmission
Age Gender
PneumoniaDX486
Acute respitory
failureDX51881
CHF DX4280
Cont inv mec ven <96 hrsPR9671
Venous cath NECPR3893
Packed cell transfusion
PR9904 Rule Repository
Valid Rule 1
Female, Diabetes, Major Operating Room, Chest X-‐ray and others
Valid Rule 2
Male, Hypertension, Echocardiology, CCU and others
Invalid Rule 3
Female, Depression, Emergency Room and others
Invalid Rule 4
Male, COPD, Emergency Room and others
36
Bayesian NetworkConstruction
Intervention Rule Generation
Intervention Recommendation
Evaluation
Compute patient risk using only non-‐procedural attributes
Compute patient risk using proceduralattributes
Compare the difference between the twoprobabilities
Store the rules where the risk isreduced after introducing the
procedures
Recommendation for New Patient
Intervention Plan 1
Major Operating Room, Chest X-‐ray and others
Intervention Plan 2
Echocardiology, CCU and others
Intervention Plan 3
Emergency Room and others
Top 3 intervention plans
Rule Repository
New Patient Attributes
Summarized Intervention Plan
Major Operating Room, Echocardiology , Chest X-‐ray and others
37
Summarize
The Rule Repository is HUGE! (over 30k rules)Parallel Solution!
Bayesian NetworkConstruction
Intervention Rule Generation
Intervention Recommendation
Evaluation
Compute similarity between establishedattribute profile and a given patient profile
Identify rules where the establishedattribute is most similar to the patient
input
Recommend interventions extractedfrom the established rules
Validation – Data Highlights
• State Inpatient Database (SID) of Washington State heart failure cohort in year 2010 (67967 patients) for training and 2011 (52021 patients) for testing
• 3908 diagnosis and 2049 procedure codes are involved.
• Feature Selection is performed using chi-‐square test.
Demographics Age, Gender, Race
Comorbidity & Diagnosis 21 comorbidities and 90 diagnosis
Utilization & Interventions 21 health service utilization flags and 70 interventions
Others Length of Stay, # of diagnosis and interventions
38
High Dimensional
Bayesian NetworkConstruction
Intervention Rule Generation
Intervention Recommendation
Evaluation
Extract patients from the test set who were notreadmitted within 30 days
Compute the evaluation metrics between the recommended interventionsand the actual interventions
Validation – Experiment Results
39
0
100
200
300
400
Linear Regression
Hill-‐Climbing Grow-‐Shrink Hybrid
Hits
0.340.350.360.370.380.390.4
Linear Regression Hill-‐Climbing Grow-‐Shrink Hybrid
Jaccard Index
0.93
0.932
0.934
0.936
0.938
0.94
0.942
Linear Regression Hill-‐Climbing Grow-‐Shrink Hybrid
Accuracy
0.45
0.5
0.55
0.6
0.65
Linear Regression Hill-‐Climbing Grow-‐Shrink Hybrid
True Positive Rate
Bayesian NetworkConstruction
Intervention Rule Generation
Intervention Recommendation
Evaluation
Back to the Chronic Readmission Case
75-‐year old, female
Chronic pulmonary disease, depression, hypertensionand diastolic heart failure
40
No-‐readmit!Cardiac catheterization lab, CT scan, echo-‐cardiology, echo-‐cardiogram,
Cardiac catheterization lab, CT scan, echo-‐cardiology, echo-‐cardiogram
Accountable Care OrganizationsCost/Charge Prediction
41
HealthSCOPE: An Interactive Distributed Data Mining Framework for Scalable Prediction of Healthcare Costs , Marquardt James, Newman Stacey,Hattarki Deepa, Srinivasan Rajagopalan, Sushmita Shanu, Ram Prabhu, Prasad Viren, Hazel David, Ramesh Archana, De Cock Martine, Teredesai Ankur, IEEE Data Mining Conference Demo Track, 2014 IEEE International Conference on DOI: 10.1109/ICDMW.2014.45 Publication Year: 2014 , Page(s): 1227 -‐1230
42
What are healthcare costs for assigned population in 2015 ?
Why is the cost so high or low ?
How does the cost distribute across demographics ?
QUESTIONS
DATA
SCIENCE
DATA
APPLICATIONS
Motivation: ACO Cost Prediction
Demographics Diagnosis Codes
Procedure Codes
DrugsLab ResultsC
linical
Claims
Sources : SID, OSHPD, MEPS Source : MultiCare Collaboration
ChargesVitals
Population Predictive Modeling
Feature Prioritization Health Prediction
Care Management
Individual Predictive Modeling
Chandola et. al, KDD 2013
Cost/Charge Prediction: Problem Description• Goal à predict the future healthcare cost of individuals based on
their past medical and cost information.• Supervised machine learning problem.• Input:• Previous health information (e.g. diagnosis, comorbidities, etc). • General demographics (age, gender, race)• Previous healthcare cost• {X} = (x1, x2, x3 ......xp)
• Output:• Y = Future healthcare cost
foo 43
foo 44
Four Scenarios for predicting cost
• Three Months of Historical data (Medical, Demographic and Cost)à Cost of Following Nine months (1Q)
• Six Months of Historical data (Medical, Demographic and Cost)à Cost of Following Six months (2Q)
• Nine Months of Historical data (Medical, Demographic and Cost)à Cost of Following Three months (3Q)
• Twelve Months of Historical data (Medical, Demographic and Cost)à Cost of Following Twelve months (4Q)
Non-‐ Gaussian Distribution of Healthcare Costs
foo 45
Makes it challenging and interesting problem for research
Existing Cost prediction Methods• Limited to Rule based or Multiple Linear Regression methods• Rule Based methods
• Requires domain knowledge• Expensive
• Multiple Linear Regression• Multi-‐collinearity Issue• Sensitive to extreme values (outliers)
• Evaluation• Estimate the mean cost of the given sampling distribution.• Often in-‐sample data used to report predictive performance.• R2 evaluation metric (not a true indicator)
Our Contributions
• Investigate the utility of state-‐of–the –art machine learning algorithms for the cost prediction problem. • We empirically evaluate three algorithms:• Regression Trees• M5 Model Trees• Random Forest
foo 47
Regression Tree
48
Age > 60?
Has Asthma?
Gender = Female?
21,00046,00062,00085,000
Yes
Yes Yes
No
No No
M5 Model Tree
foo 49
Has Asthma?
Gender = Female?
Yes
Yes Yes
No
No No
Age > 60?
Random Forest
50
Had Procedure
X?
Age > 18?Gender = Male?
21,00046,00062,00085,000
Yes
Yes Yes
# Admits > 3?
No No
Race = White?Has CHF?
21,00046,00062,00085,000
Yes
YesYesNo No
No
NoAge > 60?
Has Asthma?
21,000
Gender = Female?
46,00062,00085,000
Yes
Yes
YesNo No
No
51
Evaluation Metrics• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)
52
MAE Results – SID Data (3Q Scenario)
0
5,000
10,000
15,000
20,000
25,000
30,000
Average Baseline
Previous Cost
Regression
Multiple Linear
Regression
Regression tree
Random Forest
Model Tree
MAE
($)
Baselines
Advanced Models
53
MAE Results – MEPS Data
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Average Baseline
Previous Cost
Regression
Multiple Linear
Regression
Regression tree
Random Forest
Model Tree
MAE
($)
Baselines
Advanced Models
54
Prediction Error Results – M5 Model Trees
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
1Q 2Q 3Q 4Q
Error ($)
MAE
RMSE
Error Distribution: WA State SID Data
foo 55
For large fraction of of the population (75%), we were able to predictwith higher accuracy using these algorithms
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0% 25% 50% 75%
Maxim
um Prediction Error ($)
Portion of Population
Multiple Linear Regression
Regression Tree
Random Forest
Model Tree
Sub-‐Population Cost Prediction
Prediction
Prediction
Prediction
PopulationSub-‐Population
FutureHealthcareCost
Congestive heart failure (CHF)
Diabetes
COPD
Asthma
Coronary artery disease (CAD)
Age 65+
Most difficult cohort to predict
foo 57
0
5000
10000
15000
20000
25000
30000
35000
Asthma Diabetes CHF COPD Coronary Over 65
MAE
($)
model trees
linear regression
Engineering the Solutions: Risk-‐O-‐Readmission & Cost-‐As-‐a Service
58
Thu, Nov 7, 2013 at 10:50 AM
59
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Forwarded message -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐From: Windows Azure Pass System Admin <[email protected]>Date: Thu, Nov 7, 2013 at 10:50 AMSubject: Gifting Letter for Windows Azure Research PassTo: "Ankur M. Teredesai" <[email protected]>Cc: "Azure4Research (RFP External)" <[email protected]>
Dear Ankur M. Teredesai ,
We have approved your application for a Windows Azure Research Pass Grant. In order to receive your pass, download the Microsoft gifting letter from the following link:
Risk-‐of-‐Readmission as a Service
60
Web App for ACOs
Model
Selector
Cost Prediction API
Beneficiary ClaimsPopulation Batch/Individual
A
B
Linear Regression
Regression Trees
Individual BeneficiaryFeature Vector
Individual BeneficiaryPredicted Cost
Predicted, Previous year, Historic population Costs + population statistics
④
①
②
③
Scale Issues:Cost Prediction as a Service
R
Big Data Stack
Cost Prediction Engine
Model Bank deployed on ADAPA
Spark
Beneficiary Claims for individual
①
Predicted cost for the individual④
Web App for
Individual
WA-‐SID Claims / MEPS Survey (for training)
Data Sources
C M5 Model Trees
Web App for ACOs
Model
Selector
Cost Prediction API
Beneficiary ClaimsPopulation Batch/Individual
A
B
Linear Regression
Regression Trees
Individual BeneficiaryFeature Vector
Individual BeneficiaryPredicted Cost
Predicted, Previous year, Historic population Costs + population statistics
④
①
②
③
Cost Prediction as a Service
R
Big Data Stack
Cost Prediction Engine
Model Bank deployed on ADAPA
Spark
Beneficiary Claims for individual
①
Predicted cost for the individual④
Web App for
Individual
Data Sources
WA-‐SID Claims / MEPS Survey (for training)
C M5 Model Trees
Apache Spark
foo 63
Apache SparkHDFS
Slave 1
Slave 1
Master
Driver RDD
In Memory DataPartition 1
In Memory DataPartition 2
Spark
Spark
Spark
Data Partition1
Replica Data Partition2
Data Partition2
Replica Data Partition2
Weighted k-‐NN for Regression
foo 64
Data Partition 1
kNN1
Pred
icted Co
st
kNN2
2k NN
kNN
Node 1
Data Partition 2
Node 2
Test Instance Top k
Group
& Sort
Group & SortWeighted Average
Comput
e
kNN
Compute kNN
Rough Set
• Rough set theory is an ML framework that is especially suitable for information systems with inconsistencies.• Rough set theory handles discrete
attributes.
• Lower approximation: instances that necessarily belong to the class
• Upper approximation: instances that possibly belong to the class
Patient Age ≥ 50 Alcohol Disorder Visit Cost
P1 Yes Yes HighP2 Yes Yes HighP3 Yes No LowP4 Yes No HighP5 No No LowP6 No Yes High
Similar Patients but belong to different classes!
Fuzzy Rough Set
• Uses fuzzy logic to handle continuous attributes.• Similarity matrix contains values between 0 and 1. • Inconsistent instances are highly related but have a different class.
Patient Age Alcohol Disorder Visits CostP1 52 1 $13335P2 59 4 $277966P3 55 0 $8139P4 50 0 $66058P5 34 0 $5815P6 26 1 $38526
P1 P2 P3 P4 P5 P6P1 1 0.52 0.83 0.84 0.60 0.61P2 0.52 1.00 0.44 0.36 0.12 0.13P3 0.83 0.44 1 0.92 0.68 0.44P4 0.84 0.36 0.92 1 0.76 0.51P5 0.60 0.12 0.68 0.76 1 0.75P6 0.61 0.13 0.44 0.51 0.75 1
Fuzzy Rough Set
• Let rj,i be the degree of similarity of instances i and j.
• Let ci be the degree to which instance i belongs to the class.
• Then the degree to which instance j belongs to the:
• Lower approximation of the class is: min{max(1-rj,,i, ci) | i = 1,...,n}
• Upper approximation of the class is: max{min(rj,i, ci) | i = 1,...,n}
• Current implementations can handle only up to 100,000 instances because they keep the similarity matrix in memory.
Fuzzy Rough Set
max{min(rj,i, ci) | i = 1,...,n}
Fuzzy Rough Set
min{max(1-rj,,i, ci) | i = 1,...,n}
Implementation
• The construction of the similarity matrix can be done in a
parallel manner, making each of K compute nodes calculate n/K columns of the similarity matrix.
• No need to store the similarity matrix as a whole. • The construction of the similarity matrix
does not have to befinished before (partial) computation of
the lower and upperapproximations can begin.
Node 1 Node 2
Implementation -‐Lower Approximation
Upper Approximation
Spark vs MPI Fuzzy Rough Set
Web App for ACOs
Model
Selector
Cost Prediction API
Beneficiary ClaimsPopulation Batch/Individual
A
B
Linear Regression
Regression Trees
Individual BeneficiaryFeature Vector
Individual BeneficiaryPredicted Cost
Predicted, Previous year, Historic population Costs + population statistics
④
①
②
③
Cost Prediction as a Service
R
Big Data Stack
Cost Prediction Engine
Model Bank deployed on ADAPA
Spark
Beneficiary Claims for individual
①
Predicted cost for the individual④
Web App for
Individual
WA-‐SID Claims / MEPS Survey (for training)
Data Sources
C M5 Model Trees
Readmission Application
• Android• Windows Phone• Patient View• what is my risk
• Doctor View • who are my risky patients?• alerts• Interventions
74
foo 75
http://healthscope.cloudapp.net/hscope-‐dev/aco/
Healthcare Scalable COst Prediction Engine (HealthSCOPE)
0.6 AUCYale Model(Baseline)
76
Milestones: Readmission Risk
0.64 AUCUW 2012 Result
Ensemble method,
Hierarchical classification
Dec 2012
0.74 AUCUW 2014Result
Lab results+
New Algorithm (Adaboost)
Feb 2014
QlikViewReadmission
App
Dec 2013
Machine Learning Process to Target New Chronic Diseases
Aug 2014 -‐> Moving Forward
Integrating care pathway
March 2014
Bayesian Network Learning
AUC – Accuracy measure (Area Under Curve)
Real Time Care
Factors & Pathways
July 2014
with EPIC
Post-‐Discharge(Clinical data)
June 2013
Risk-‐o-‐MeterDevelopment
+ Big data Efforts
Pre-‐Admission(Clinical data)
Post-‐Discharge(Claim data)
Post-‐Admission(Clinical data)
IEEE Big DataREF #3
KDDREF #1 & 2
HEALTHINFREF #4 & 5
KDDREF #6
ICDM 2014REF #6
Problem Explorat
ion
77
Milestones: Cost Prediction
H-‐SCOPE ISID Data
June 2014
H-‐SCOPE IVSID + MEPS
data
Nov. 2014
H-‐SCOPE IIIAdapa Scoring
EngineSpark
Framework
Sept. 2014 Aug 2015 -‐> Moving Forward
H-‐SCOPE VFive Cohort
Dec. 2014
M5 Model Trees
Random Forest
Regression Tress
HealthSCOPE VI
July 2015
Admit Level
August 2014
H-‐SCOPE IIPopulation View
(ACO)OSHPD Data Application
Beneficiary Level
Beneficiary View
Four Future Scenario
ICDM 2014 KDD-‐2015 AMIA-‐2015
Sub-‐Population
DeepLearning
Time &Cost OfHospital
readmission
H-‐SCOPE VIIAHRQ Private
data
WWW-‐DigitalHealth-‐2015
Time, CostAnd
Illness (Alignment)Prediction
78
AUC – Accuracy measure (Area Under Curve)
2012
78
Milestones: Merging Threads
2016 and beyond2013 2014 2015
Risk of Readmission (Clinical, Sociological & Claims)
2014 2015
Cost Prediction (Claims and secondary data sources)
2015
Risk & Cost Convergence
Flat Files CSV Claims X12 Clinical HL7
Distance Compute Library
Instance Selection
RNGE Drop 3
Fuzzy Rough Set Approximation
CHF Risk of Readmission
Geo Routing
Random Forests KNN
Industry Partners and Domain Experts
Other Solutions
HDFS NUMA
MPI Grappa
Census US Gov Unstructured CCD
Bayesian Networks
Support VectorMachines
79
Cost of Chronic Interventions
Age/Gender Prediction
Malware Analytics
Personalized Cancer Therapy
ETL Tools
Raw Data from Sources (SID, OSHPD, HCUP, Edifecs, MHS, CMS, LINCS, Industry)
Sqoo
p
Flat Files CSV Claims X12 Clinical HL7
Distance Compute Library
Instance Selection
RNGE Drop 3
Fuzzy Rough Set Approximation
Personalized Cancer Therapy
Geo Routing
Random Forests KNN
Industry Partners and Domain Experts
Other Solutions
HDFS NUMA
MPI Grappa
Census US Gov Unstructured CCD
Bayesian Networks
Support VectorMachines
80
Cost of Chronic Interventions
Age/Gender Prediction
Malware Analytics
CHF Risk of Readmission
ETL Tools
Raw Data from Sources (SID, OSHPD, HCUP, Edifecs, MHS, CMS, LINCS, Industry)
Sqoo
p
81
Our Sincere Thanks for Your Support!