societal impact of applied data science on the big data stack

Product DecisionsthroughBig Data

Center for Data ScienceAnkur Teredesai

University of Washington Tacoma

1March 14th, 2015

• Bioinformatics• Health and Wellness

• Predictive Analytics

Health Informatics

• Distributed Systems• Databases• Geo-‐Spatial• Embedded Systems

Geo-‐Spatial Data Management

• Machine Learning• Data Mining• Computation Intelligence

• Computer Vision

Intelligent Systems

•Web• Devices•Mobile Networks• UX / UI

Social Computing

• Cryptology• Secure Machine Learning

Big Data Security

• Engineering• Dev-‐Ops

Big Data Infrastructure

Center for Data Science: Societal Impact

Machine LearningAnalytics

Engineering

Features

AlgorithmScalability

ELT

Integrate Sources

Constraints

Deploy ModelsAPIsApps

Data Struggles

A Big Data Project Blueprint:

3

Data Mining: 1989 -‐ 2010

• Data Science and Applications move and transform sizeable amounts of data out of the native database or file systems.

Applications

SQL/ODBC/JDBC Data Access

Distributed DatabaseMulti-Core, Columnar, Key-Value




Data Science using R, SAS, SPSS, Weka, MAHOUT

HIGH

VOLUME

HIGH

LATENCY

HIGH

VOLUME

Application Ecosystem Integration

Data Science uses native data representation and inherent distribution and parallelism

Minimal data movement

Rapid Application development using data science constructs

5

Big Data Science

Application Ecosystem Integration

Applications

SQL/ODBC/JDBC Data Access

Data Science•Internal Algorithms for clustering, •classification, regression


LOWER

VOLUME

LOWER

LATENCY

HIGH

VOLUME

LOW

LATENCYBig Data Science Components

A Short History of (Big) Data Technology1970: Codd invents “A Relational Model of Data for Large Shared

Data Banks”

1985: Copeland –Decomposition Storage Model (essentially the first Columnar Store)

1989: Shared-‐Nothing Architecture

2004: Google –MapReduce

2005: C-‐Store (Eventually Vertica),

layers WS/RS

2007: Materialization Optimizations in

Columnar Stores and Hadoop Implementation

2005-‐07: Star-‐Schema Benchmark+ Hadoop

2008: Attempts to backport columnar advances to row storage, not very

effective

Today: BIG DATA

Technology Decisions

7

Columnar Vs Relational Storage Technologies

Infinite scale using commodity hardware

Private or Public Cloud

Massively Distributed and Parallel Architecture: Hadoop

Stream Query Processing for trillions of events and petabytes of data

Real-time classification and clustering: Approximate scoring and segmentation + Reporting and Data Visualization

Flat Files CSV Claims X12 Clinical HL7

Distance Compute Library

Instance Selection

RNGE Drop 3

Fuzzy Rough Set Approximation

CHF Risk of Readmission

Geo Routing

Random Forests KNN

Industry Partners and Domain Experts

Other Solutions

HDFS NUMA

MPI Grappa

Census US Gov Unstructured CCD

Bayesian Networks

Support VectorMachines

8

Cost of Chronic Interventions

Age/Gender Prediction

Malware Analytics

Personalized Cancer Therapy

ETL Tools

Raw Data from Sources (SID, OSHPD, HCUP, Edifecs, MHS, CMS, LINCS, Industry)

Sqoo

p

iTornado

Routing Service With Real World Severe Weather

Demo Paper in ACM SIGSPATIAL 2014(Best Demo paper award)

Fatalities Stats byWeather Related Hazards http://www.nws.noaa.gov, June 2014.

COMA

Road Network Compression For Map Matching

ACM SigSpatial IWGS 2014

PreGo

Dynamic Multi-‐Preference Routing

Single Attribute

MultipleAttribute

Time-‐Homogenous Dijkstra, A* Stewart et al 91

Time-‐Variant Betsy et al 07 ?

<3,4>

<2,2>

<5,7>

<0,0>

a

s

b

e

T=[1,2,3,4,5]R=[1,2,3,4,5]

T=[1,2,3,4,5]R=[1,2,3,4,5]

d

c g

f h

T=[1,2,3,4,5]R=[1,2,3,4,5]

T=[1,2,3,4,5]R=[1,2,3,4,5]

T=[5,1,3,4,5]R=[7,1,2,4,5]

T=[1,1,3,4,5]R=[1,2,3,4,5]

T=[2,1,3,4,5]R=[2,1,3,4,5]

T=[1,2,2,4,3]R=[2,1,5,4,3]

T=[1,2,3,1,1]R=[1,2,3,0,1]

<1,1>

<4,4>

T=[4,2,1,3,5]R=[3,2,1,4,5]

Special Needs Education: Teacher Trainer Effectiveness AnalysisCustomized Surveys

Training Registration

Survey Management

To support streamlined data collection and performance evaluation across the State Needs

Projects.Project Stakeholders

Office of the Superintendent of Public Instruction

Center for Data Science

Data Dashboard Purpose Report Generation

Geographic Distribution Maps

Demographic Reports

Brad Porter, Aniruddha Desai, Yitao Li, David Hazel, Michelle Maike, Greg Benner, Ankur Teredesai, Leslie Pyper, Vickie Green

Systems Biology

13

Predictive Models and software

Applications: Personalized medicine, drug discovery

Focus: Develop machine learning methods and tools to effectively integrate multiple big data sources in biology.

A Flying Hadoop Cluster

14

Detecting Malware Activity based on Automatically Generated Domains

Command & Control

xyz.com xyz.com

Infected node

Partnering with NIARA we obtained a large dataset of Automatically Generated Domains.

Based on the intercepted domain features we are able to identify the malware infecting a network.

(March 2012)• Will this Heart Failure patient get readmitted within 30 days?• Yes or No (Binary Classification)

16

Reduce CHF Readmission

Readmission ?Machine Learning?

Joint NSF / NIH Solicitation on Health Care and Big Data

Affordable Care Act => Avoidable CostsReadmissions are AVOIDABLE

20%32%

30 days

60 days

75%

25% Non CHF

CHF

• Readmissions national cost $17 billion annually

• 76 % considered avoidable

17

Readmissions

Congestive Heart Failure (CHF)

Source: www.presidency.ucsb.edu, cdc.gov, tmz.com

Patient

Class Labels

No readmissionReadmission

CHF ROR: 30-‐Day Hospital Readmission Risk Prediction

Machine Learning Algorithms

18

Building the

model

Scoring the tuple

FeaturesVector

FeaturesVectors

New patient

No readmissionReadmission

19

Some of the Steps

Data UnderstandingAnd Integration

Data Cleaning

Data Transformation

Extracting data from Epic -‐16 data marts and 200 views:Heart Failure Inpatient SummaryEncounter.FlowsheetPatientEncounterHospital

vs

Public Data:State Inpatient Dataset 2009-‐2012

20

AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE DXCCS1 PRCCS1 TOTCHG52 98122 1 3 12 3 0 153 212 56,51187 98109 1 3 7 1 1 162 -‐ 12,68726 98028 4 3 1 30 1 139 195 127,300

• Washington State Inpatient Data• Admission level Claims • ~400 attributes

• Demographics• ICD9 Diagnosis codes• ICD9 Procedure codes• Charges

• Admissions by year• 2009 – 652702• 2010 – 651783• 2011 – 648079• 2012 – 648092

Variety and Volume (2/3 V’s of Big Data)

Pre Admission Post Admission Pre-‐ Discharge Discharge-‐ Demographics

-‐ Vital Sign

-‐Prior Hospitalization

Pulse rate Blood pressure Respiration rate BMI

Number of prior admissionsPrior length of stay

+ Demographics

Sodium levelGlucose levelHemoglobin levelCreatinine level

Hematocrit levelNeutrophils level

Ejection Fraction BUN level

+ Vital Sign+ Prior Hospitalization-‐ Lab Test

+ Vital Sign+ Prior Hospitalization

+ Demographics

+ Lab Test-‐ Diagnosis InformationNumber of secondary diagnosisChronic systolic heart failure Acute kidney failure Chest painHyper potassemia BronchopneumoniaOther chronic pulmonary heart diseases Syncope and collapse …

+ Prior Hospitalization+ Demographics

-‐ ComorbiditiesAcute coronary syndrome AsthmaCOPD Ulcer Dialysis DementiaArrhythmias Mal Nutrition Vascular Depression

-‐ Discharge/Admit codesAdmit /Discharge typeSeverity Of illness Risk Of Mortality

-‐ Utilization InformationOperating room CTSCANEmergency Room CCU

Marital status AgeRacial group Gender

(Dec 2012) Initial Models

22

Data integration

Feature Construction

Predictive modeling

• Logistic Regression• Naïve Bayes• Support Vector Machines

0.6

0.72

0.64

0.540.560.580.6

0.620.640.660.680.7

0.720.74

Yale Model (Comparative …

Amarasingham et al.

Our current Result

Area Und

er th

e Cu

rve (AUC)

Several Rejects: KDD Industry Track 2013AMIA 2013JAMIA 2013

2012

(July 2013) (much better) & Some Papers

§ Improved data exploration§ S.-‐C. Chin, K. Zolfaghar, S. Basu Roy, A.Teredesai, and P. Amoroso, "Divide-‐n-‐Discover -‐-‐ Discretization based DataExploration Framework forHealthcare Analytics," 7thInternational Conference on HealthInformatics (HEALTHINF Short Paper),Angers, France, 2014

§ N. Meadem, N. Verbiest, K. Zolfaghar,J. Agarwal, S.-‐C. Chin, S. Basu Roy, A.Teredesai, D. Hazel, P. Amoroso, andL. Reed, "Predicting Risk ofReadmission for Congestive HeartFailure Patients," Workshop on DataMining for Healthcare (DMH),Chicago, IL, 2013

23

0.6

0.720.64

0.74

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Yale Model (Comparative Baseline)

Amarasingham et al.

Our 2012 Result Our current Result

Area Und

er th

e Cu

rve (AUC)

§Improved Modeling Effort

(Dec 2013) Prototype or a possible Product? & yes, More Papers§ Successful Deployment

24

§K. Zolfaghar, J. Agarwal, D. Sistla, S.-‐C. Chin, S. Basu Roy, and N. Verbiest, "Risk-‐O-‐Meter: An IntelligentClinical Risk Calculator," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD), Chicago, IL, 2013

§Kiyana Zolfaghar, Naren Meadem, Ankur Teredesai, Senjuti Basu Roy, Si-‐Chi Chin, Brian Muckian: Bigdata solutions for predicting risk-‐of-‐readmission for congestive heart failure patients. BigDataConference 2013: 64-‐71

25

Multi Layer Classifier : Automatically Detecting Classification Windows

Will patient ever readmit?

Will patient readmitwithin 30 days?

YES NO

YES NO

KNN

LRNBSVMKNN

32% of all data

Only 5% of patients that returnwithin 30 days is filtered out

Generalizing the 30,60,90 Day readmission

§ Automatic design of time prediction hierarchy§ Feature selection and factor analysis at each layer§ Different classification algorithms in each layer and satisfying different quality metrics

26

Automatic design of prediction hierarchy

27

Simple 3 Layer Example

• Stage 1: Design a predictive model for the patients who are likely to come back within a time window of (X, K), where X is the maximum number of days until next readmission• Stage 2: Design a predictive model for time window of (K, 30)• Stage 3: Design a predictive model for time window of <30 days of readmission

HOW TO AUTOMATICALLY DETECT THE MIDDLE CUTPOINT K?

28

Hill Climbing Algorithm to Detect K

§Generate a random number K between X and 30§ Compute C1= Centroid(X,K) , C2= Centroid(K+1,30)§ Compute the KLCurrent = KLDiv(C1,C2)§ K’=K+i K”=K-‐i§ Find a point K2 between (K’,K’’) , and check§ If KLDiv( Centroid(X,K2), Centroid(K2,30)) > KLCurrent§ If the above condition is satisfied, then K=K2§ KLCurrent = KLDiv( Centroid(X,K2), Centroid(K2,30)) § Repeat the above steps until no further check is possible

29

30

Calculating the Probability of 30 day RoR

P(readmit ≤ 30) = P(≤ 30 |≤ K )×P(≤ K |Y )P(Y )

Risk-‐O-‐MeterDistinguishing Features

31

Risk-‐O-‐Meter

Users

Current Systems

Healthcare provider

and Patients

Only

healthcare providers

Result explanation

and exploration

Need deep domain Knowledge

Handle incomplete patient input

All in one Package – Risk-‐O-‐Meter (KDD 2013)

32

Pre Admission Post Admission Pre -‐ Discharge Discharge

Post-‐Discharge Care

Management Pipeline

“White Gap”PCP HF ServiceCare Management

Payer

ChroniRisk Continuous Readmission Risk Assessment Across Continuum of Care*

78%*

42%*

Service Line EMRPCP Tools Psycho-‐social risk scoring

2013 HF Readmission Statistics• 7.1 M Readmits• 5.3 M Avoidable• $13,000 each• $13 B opportunity cost

Patient Encounters Scored

+18,000 (HF cohort)

Risk – DoneCost – DoneNext?

Actionable InterventionsIf we can predict can we recommend?

34

A Framework to Recommend Interventions for 30-‐Day Heart Failure Readmission Risk, Rui Liu, KiyanaZolfaghar, SC Chin, Senjuti Basu Roy, Ankur Teredesai, Data Mining (ICDM), 2014 IEEE International Conference on DOI: 10.1109/ICDM.2014.89 Publication Year: 2014 , Page(s): 911 -‐ 916

A real and common Chronic Readmission

75-‐year old, female

Chronic pulmonary disease, depression, hypertensionand diastolic heart failure

High Risk

Medium Risk

Low Risk

35

Readmit!Intervention Plan 1

Major Operating Room, Chest X-‐ray and others

Intervention Plan 2

Echocardiology, CCU and others

Intervention Plan 3

Emergency Room and others

Risk will be lower when the interventions are performed

The patient is not readmitted

Intervention Rule Generation

Readmission

Age Gender

PneumoniaDX486

Acute respitory

failureDX51881

CHF DX4280

Cont inv mec ven <96 hrsPR9671

Venous cath NECPR3893

Packed cell transfusion

PR9904 Rule Repository

Valid Rule 1

Female, Diabetes, Major Operating Room, Chest X-‐ray and others

Valid Rule 2

Male, Hypertension, Echocardiology, CCU and others

Invalid Rule 3

Female, Depression, Emergency Room and others

Invalid Rule 4

Male, COPD, Emergency Room and others

36

Bayesian NetworkConstruction


Intervention Recommendation

Evaluation

Compute patient risk using only non-‐procedural attributes

Compute patient risk using proceduralattributes

Compare the difference between the twoprobabilities

Store the rules where the risk isreduced after introducing the

procedures

Recommendation for New Patient

Intervention Plan 1

Major Operating Room, Chest X-‐ray and others

Intervention Plan 2

Echocardiology, CCU and others

Intervention Plan 3

Emergency Room and others

Top 3 intervention plans

Rule Repository

New Patient Attributes

Summarized Intervention Plan

Major Operating Room, Echocardiology , Chest X-‐ray and others

37

Summarize

The Rule Repository is HUGE! (over 30k rules)Parallel Solution!




Evaluation

Compute similarity between establishedattribute profile and a given patient profile

Identify rules where the establishedattribute is most similar to the patient

input

Recommend interventions extractedfrom the established rules

Validation – Data Highlights

• State Inpatient Database (SID) of Washington State heart failure cohort in year 2010 (67967 patients) for training and 2011 (52021 patients) for testing

• 3908 diagnosis and 2049 procedure codes are involved.

• Feature Selection is performed using chi-‐square test.

Demographics Age, Gender, Race

Comorbidity & Diagnosis 21 comorbidities and 90 diagnosis

Utilization & Interventions 21 health service utilization flags and 70 interventions

Others Length of Stay, # of diagnosis and interventions

38

High Dimensional




Evaluation

Extract patients from the test set who were notreadmitted within 30 days

Compute the evaluation metrics between the recommended interventionsand the actual interventions

Validation – Experiment Results

39

0

100

200

300

400

Linear Regression

Hill-‐Climbing Grow-‐Shrink Hybrid

Hits

0.340.350.360.370.380.390.4

Linear Regression Hill-‐Climbing Grow-‐Shrink Hybrid

Jaccard Index

0.93

0.932

0.934

0.936

0.938

0.94

0.942


Accuracy

0.45

0.5

0.55

0.6

0.65


True Positive Rate




Evaluation

Back to the Chronic Readmission Case

75-‐year old, female

Chronic pulmonary disease, depression, hypertensionand diastolic heart failure

40

No-‐readmit!Cardiac catheterization lab, CT scan, echo-‐cardiology, echo-‐cardiogram,

Cardiac catheterization lab, CT scan, echo-‐cardiology, echo-‐cardiogram

Accountable Care OrganizationsCost/Charge Prediction

41

HealthSCOPE: An Interactive Distributed Data Mining Framework for Scalable Prediction of Healthcare Costs , Marquardt James, Newman Stacey,Hattarki Deepa, Srinivasan Rajagopalan, Sushmita Shanu, Ram Prabhu, Prasad Viren, Hazel David, Ramesh Archana, De Cock Martine, Teredesai Ankur, IEEE Data Mining Conference Demo Track, 2014 IEEE International Conference on DOI: 10.1109/ICDMW.2014.45 Publication Year: 2014 , Page(s): 1227 -‐1230

42

What are healthcare costs for assigned population in 2015 ?

Why is the cost so high or low ?

How does the cost distribute across demographics ?

QUESTIONS

DATA

SCIENCE

DATA

APPLICATIONS

Motivation: ACO Cost Prediction

Demographics Diagnosis Codes

Procedure Codes

DrugsLab ResultsC

linical

Claims

Sources : SID, OSHPD, MEPS Source : MultiCare Collaboration

ChargesVitals

Population Predictive Modeling

Feature Prioritization Health Prediction

Care Management

Individual Predictive Modeling

Chandola et. al, KDD 2013

Cost/Charge Prediction: Problem Description• Goal à predict the future healthcare cost of individuals based on

their past medical and cost information.• Supervised machine learning problem.• Input:• Previous health information (e.g. diagnosis, comorbidities, etc). • General demographics (age, gender, race)• Previous healthcare cost• {X} = (x1, x2, x3 ......xp)

• Output:• Y = Future healthcare cost

foo 43

foo 44

Four Scenarios for predicting cost

• Three Months of Historical data (Medical, Demographic and Cost)à Cost of Following Nine months (1Q)

• Six Months of Historical data (Medical, Demographic and Cost)à Cost of Following Six months (2Q)

• Nine Months of Historical data (Medical, Demographic and Cost)à Cost of Following Three months (3Q)

• Twelve Months of Historical data (Medical, Demographic and Cost)à Cost of Following Twelve months (4Q)

Non-‐ Gaussian Distribution of Healthcare Costs

foo 45

Makes it challenging and interesting problem for research

Existing Cost prediction Methods• Limited to Rule based or Multiple Linear Regression methods• Rule Based methods

• Requires domain knowledge• Expensive

• Multiple Linear Regression• Multi-‐collinearity Issue• Sensitive to extreme values (outliers)

• Evaluation• Estimate the mean cost of the given sampling distribution.• Often in-‐sample data used to report predictive performance.• R2 evaluation metric (not a true indicator)

Our Contributions

• Investigate the utility of state-‐of–the –art machine learning algorithms for the cost prediction problem. • We empirically evaluate three algorithms:• Regression Trees• M5 Model Trees• Random Forest

foo 47

Regression Tree

48

Age > 60?

Has Asthma?

Gender = Female?

21,00046,00062,00085,000

Yes

Yes Yes

No

No No

M5 Model Tree

foo 49

Has Asthma?

Gender = Female?

Yes

Yes Yes

No

No No

Age > 60?

Random Forest

50

Had Procedure

X?

Age > 18?Gender = Male?

21,00046,00062,00085,000

Yes

Yes Yes

# Admits > 3?

No No

Race = White?Has CHF?

21,00046,00062,00085,000

Yes

YesYesNo No

No

NoAge > 60?

Has Asthma?

21,000

Gender = Female?

46,00062,00085,000

Yes

Yes

YesNo No

No

51

Evaluation Metrics• Mean Absolute Error (MAE)

• Root Mean Squared Error (RMSE)

52

MAE Results – SID Data (3Q Scenario)

0

5,000

10,000

15,000

20,000

25,000

30,000

Average Baseline

Previous Cost

Regression

Multiple Linear

Regression

Regression tree

Random Forest

Model Tree

MAE

($)

Baselines

Advanced Models

53

MAE Results – MEPS Data

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Average Baseline

Previous Cost

Regression

Multiple Linear

Regression

Regression tree

Random Forest

Model Tree

MAE

($)

Baselines

Advanced Models

54

Prediction Error Results – M5 Model Trees

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

1Q 2Q 3Q 4Q

Error ($)

MAE

RMSE

Error Distribution: WA State SID Data

foo 55

For large fraction of of the population (75%), we were able to predictwith higher accuracy using these algorithms

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0% 25% 50% 75%

Maxim

um Prediction Error ($)

Portion of Population

Multiple Linear Regression

Regression Tree

Random Forest

Model Tree

Sub-‐Population Cost Prediction

Prediction

Prediction

Prediction

PopulationSub-‐Population

FutureHealthcareCost

Congestive heart failure (CHF)

Diabetes

COPD

Asthma

Coronary artery disease (CAD)

Age 65+

Most difficult cohort to predict

foo 57

0

5000

10000

15000

20000

25000

30000

35000

Asthma Diabetes CHF COPD Coronary Over 65

MAE

($)

model trees

linear regression

Engineering the Solutions: Risk-‐O-‐Readmission & Cost-‐As-‐a Service

58

Thu, Nov 7, 2013 at 10:50 AM

59

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Forwarded message -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐From: Windows Azure Pass System Admin <[email protected]>Date: Thu, Nov 7, 2013 at 10:50 AMSubject: Gifting Letter for Windows Azure Research PassTo: "Ankur M. Teredesai" <[email protected]>Cc: "Azure4Research (RFP External)" <[email protected]>

Dear Ankur M. Teredesai ,

We have approved your application for a Windows Azure Research Pass Grant. In order to receive your pass, download the Microsoft gifting letter from the following link:

Risk-‐of-‐Readmission as a Service

60

Web App for ACOs

Model

Selector

Cost Prediction API

Beneficiary ClaimsPopulation Batch/Individual

A

B

Linear Regression

Regression Trees

Individual BeneficiaryFeature Vector

Individual BeneficiaryPredicted Cost

Predicted, Previous year, Historic population Costs + population statistics

④

①

②

③

Scale Issues:Cost Prediction as a Service

R

Big Data Stack

Cost Prediction Engine

Model Bank deployed on ADAPA

Spark

Beneficiary Claims for individual

①

Predicted cost for the individual④

Web App for

Individual

WA-‐SID Claims / MEPS Survey (for training)

Data Sources

C M5 Model Trees

Web App for ACOs

Model

Selector

Cost Prediction API


A

B

Linear Regression

Regression Trees




④

①

②

③

Cost Prediction as a Service

R

Big Data Stack



Spark


①


Web App for

Individual

Data Sources


C M5 Model Trees

Apache Spark

foo 63

Apache SparkHDFS

Slave 1

Slave 1

Master

Driver RDD

In Memory DataPartition 1

In Memory DataPartition 2

Spark

Spark

Spark

Data Partition1

Replica Data Partition2

Data Partition2

Replica Data Partition2

Weighted k-‐NN for Regression

foo 64

Data Partition 1

kNN1

Pred

icted Co

st

kNN2

2k NN

kNN

Node 1

Data Partition 2

Node 2

Test Instance Top k

Group

& Sort

Group & SortWeighted Average

Comput

e

kNN

Compute kNN

Rough Set

• Rough set theory is an ML framework that is especially suitable for information systems with inconsistencies.• Rough set theory handles discrete

attributes.

• Lower approximation: instances that necessarily belong to the class

• Upper approximation: instances that possibly belong to the class

Patient Age ≥ 50 Alcohol Disorder Visit Cost

P1 Yes Yes HighP2 Yes Yes HighP3 Yes No LowP4 Yes No HighP5 No No LowP6 No Yes High

Similar Patients but belong to different classes!

Fuzzy Rough Set

• Uses fuzzy logic to handle continuous attributes.• Similarity matrix contains values between 0 and 1. • Inconsistent instances are highly related but have a different class.

Patient Age Alcohol Disorder Visits CostP1 52 1 $13335P2 59 4 $277966P3 55 0 $8139P4 50 0 $66058P5 34 0 $5815P6 26 1 $38526

P1 P2 P3 P4 P5 P6P1 1 0.52 0.83 0.84 0.60 0.61P2 0.52 1.00 0.44 0.36 0.12 0.13P3 0.83 0.44 1 0.92 0.68 0.44P4 0.84 0.36 0.92 1 0.76 0.51P5 0.60 0.12 0.68 0.76 1 0.75P6 0.61 0.13 0.44 0.51 0.75 1

Fuzzy Rough Set

• Let rj,i be the degree of similarity of instances i and j.

• Let ci be the degree to which instance i belongs to the class.

• Then the degree to which instance j belongs to the:

• Lower approximation of the class is: min{max(1-rj,,i, ci) | i = 1,...,n}

• Upper approximation of the class is: max{min(rj,i, ci) | i = 1,...,n}

• Current implementations can handle only up to 100,000 instances because they keep the similarity matrix in memory.

Fuzzy Rough Set

max{min(rj,i, ci) | i = 1,...,n}

Fuzzy Rough Set

min{max(1-rj,,i, ci) | i = 1,...,n}

Implementation

• The construction of the similarity matrix can be done in a

parallel manner, making each of K compute nodes calculate n/K columns of the similarity matrix.

• No need to store the similarity matrix as a whole. • The construction of the similarity matrix

does not have to befinished before (partial) computation of

the lower and upperapproximations can begin.

Node 1 Node 2

Implementation -‐Lower Approximation

Upper Approximation

Spark vs MPI Fuzzy Rough Set

Web App for ACOs

Model

Selector

Cost Prediction API


A

B

Linear Regression

Regression Trees




④

①

②

③

Cost Prediction as a Service

R

Big Data Stack



Spark


①


Web App for

Individual


Data Sources

C M5 Model Trees

Readmission Application

• Android• Windows Phone• Patient View• what is my risk

• Doctor View • who are my risky patients?• alerts• Interventions

74

foo 75

http://healthscope.cloudapp.net/hscope-‐dev/aco/

Healthcare Scalable COst Prediction Engine (HealthSCOPE)

0.6 AUCYale Model(Baseline)

76

Milestones: Readmission Risk

0.64 AUCUW 2012 Result

Ensemble method,

Hierarchical classification

Dec 2012

0.74 AUCUW 2014Result

Lab results+

New Algorithm (Adaboost)

Feb 2014

QlikViewReadmission

App

Dec 2013

Machine Learning Process to Target New Chronic Diseases

Aug 2014 -‐> Moving Forward

Integrating care pathway

March 2014

Bayesian Network Learning

AUC – Accuracy measure (Area Under Curve)

Real Time Care

Factors & Pathways

July 2014

with EPIC

Post-‐Discharge(Clinical data)

June 2013

Risk-‐o-‐MeterDevelopment

+ Big data Efforts

Pre-‐Admission(Clinical data)

Post-‐Discharge(Claim data)

Post-‐Admission(Clinical data)

IEEE Big DataREF #3

KDDREF #1 & 2

HEALTHINFREF #4 & 5

KDDREF #6

ICDM 2014REF #6

Problem Explorat

ion

77

Milestones: Cost Prediction

H-‐SCOPE ISID Data

June 2014

H-‐SCOPE IVSID + MEPS

data

Nov. 2014

H-‐SCOPE IIIAdapa Scoring

EngineSpark

Framework

Sept. 2014 Aug 2015 -‐> Moving Forward

H-‐SCOPE VFive Cohort

Dec. 2014

M5 Model Trees

Random Forest

Regression Tress

HealthSCOPE VI

July 2015

Admit Level

August 2014

H-‐SCOPE IIPopulation View

(ACO)OSHPD Data Application

Beneficiary Level

Beneficiary View

Four Future Scenario

ICDM 2014 KDD-‐2015 AMIA-‐2015

Sub-‐Population

DeepLearning

Time &Cost OfHospital

readmission

H-‐SCOPE VIIAHRQ Private

data

WWW-‐DigitalHealth-‐2015

Time, CostAnd

Illness (Alignment)Prediction

78

AUC – Accuracy measure (Area Under Curve)

2012

78

Milestones: Merging Threads

2016 and beyond2013 2014 2015

Risk of Readmission (Clinical, Sociological & Claims)

2014 2015

Cost Prediction (Claims and secondary data sources)

2015

Risk & Cost Convergence



Instance Selection

RNGE Drop 3



Geo Routing

Random Forests KNN


Other Solutions

HDFS NUMA

MPI Grappa


Bayesian Networks


79



Malware Analytics


ETL Tools


Sqoo

p



Instance Selection

RNGE Drop 3



Geo Routing

Random Forests KNN


Other Solutions

HDFS NUMA

MPI Grappa


Bayesian Networks


80



Malware Analytics


ETL Tools


Sqoo

p

81

Our Sincere Thanks for Your Support!

societal impact of applied data science on the big data stack

Data & Analytics

relational model of

native data representation

sizeable amounts of

large shared data banks1985

columnar stores

columnar store1989

columnar advances

keyvaluedata science