societal impact of applied data science on the big data stack

81
Product Decisions through Big Data Center for Data Science Ankur Teredesai University of Washington Tacoma 1 March 14th, 2015

Upload: uw-center-for-data-science

Post on 14-Jul-2015

252 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Societal Impact of Applied Data Science on the Big Data Stack

Product  DecisionsthroughBig  Data

Center  for  Data  ScienceAnkur  Teredesai

University  of  Washington  Tacoma

1March  14th,  2015

Page 2: Societal Impact of Applied Data Science on the Big Data Stack

• Bioinformatics• Health  and  Wellness

• Predictive  Analytics

Health  Informatics

• Distributed  Systems• Databases• Geo-­‐Spatial• Embedded  Systems

Geo-­‐Spatial  Data  Management

• Machine  Learning• Data  Mining• Computation  Intelligence

• Computer  Vision

Intelligent  Systems

•Web• Devices•Mobile  Networks• UX  /  UI

Social  Computing

• Cryptology• Secure  Machine  Learning

Big  Data  Security

• Engineering• Dev-­‐Ops

Big  Data  Infrastructure

Center  for  Data  Science:  Societal  Impact

Page 3: Societal Impact of Applied Data Science on the Big Data Stack

Machine  LearningAnalytics

Engineering

Features

AlgorithmScalability

ELT

Integrate  Sources

Constraints

Deploy  ModelsAPIsApps

Data  Struggles

A  Big  Data  Project  Blueprint:

3

Page 4: Societal Impact of Applied Data Science on the Big Data Stack

Data  Mining:  1989  -­‐ 2010  

• Data  Science  and  Applications  move  and  transform  sizeable  amounts  of  data  out  of  the  native  database  or  file  systems.

Applications

SQL/ODBC/JDBC  Data  Access

Distributed  DatabaseMulti-­Core,  Columnar,  Key-­Value

Distributed  DatabaseMulti-­Core,  Columnar,  Key-­Value

Distributed  DatabaseMulti-­Core,  Columnar,  Key-­Value

Distributed  DatabaseMulti-­Core,  Columnar,  Key-­Value

Data  Science  using  R,  SAS,  SPSS,  Weka,  MAHOUT

HIGH

VOLUME

HIGH

LATENCY

HIGH

VOLUME

Application  Ecosystem  Integration

Page 5: Societal Impact of Applied Data Science on the Big Data Stack

Data  Science  uses  native  data  representation  and  inherent  distribution  and  parallelism

Minimal  data  movement

Rapid  Application  development  using  data  science  constructs

5

Big  Data  Science

Application  Ecosystem  Integration

Applications

SQL/ODBC/JDBC  Data  Access

Data  Science•Internal  Algorithms  for  clustering,  •classification,    regression

Distributed  DatabaseMulti-­Core,  Columnar,  Key-­Value

LOWER

VOLUME

LOWER

LATENCY

HIGH

VOLUME

LOW

LATENCYBig  Data  Science  Components

Page 6: Societal Impact of Applied Data Science on the Big Data Stack

A  Short  History  of  (Big)  Data  Technology1970:  Codd  invents  “A  Relational  Model  of  Data  for  Large  Shared  

Data  Banks”

1985:  Copeland  –Decomposition  Storage  Model  (essentially  the  first  Columnar  Store)

1989:  Shared-­‐Nothing  Architecture

2004:  Google  –MapReduce

2005:  C-­‐Store  (Eventually  Vertica),  

layers  WS/RS

2007:  Materialization  Optimizations  in  

Columnar  Stores  and  Hadoop Implementation

2005-­‐07:  Star-­‐Schema  Benchmark+  Hadoop

2008:  Attempts  to  backport columnar  advances  to  row  storage,  not  very  

effective

Today:  BIG  DATA

Page 7: Societal Impact of Applied Data Science on the Big Data Stack

Technology  Decisions

7

Columnar  Vs Relational  Storage  Technologies

Infinite  scale  using  commodity  hardware

Private  or  Public  Cloud

Massively  Distributed  and  Parallel  Architecture:  Hadoop

Stream  Query  Processing  for  trillions  of  events  and  petabytes  of  data

Real-­time  classification and  clustering:  Approximate  scoring  and  segmentation  +  Reporting  and  Data  Visualization

Page 8: Societal Impact of Applied Data Science on the Big Data Stack

Flat  Files  CSV Claims  X12 Clinical    HL7

Distance  Compute  Library

Instance  Selection  

RNGE Drop  3

Fuzzy  Rough  Set  Approximation

CHF  Risk  of  Readmission

Geo  Routing

Random  Forests KNN

Industry  Partners  and  Domain  Experts

Other  Solutions

HDFS NUMA

MPI Grappa

Census  US  Gov Unstructured  CCD

Bayesian  Networks

Support  VectorMachines

8

Cost  of  Chronic  Interventions

Age/Gender  Prediction

Malware  Analytics

Personalized  Cancer  Therapy

ETL  Tools

Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry)

Sqoo

p

Page 9: Societal Impact of Applied Data Science on the Big Data Stack

iTornado

Routing  Service  With  Real  World  Severe  Weather

Demo  Paper  in  ACM  SIGSPATIAL 2014(Best  Demo  paper  award)

Fatalities  Stats  byWeather Related  Hazards  http://www.nws.noaa.gov,  June  2014.

Page 10: Societal Impact of Applied Data Science on the Big Data Stack

COMA

Road  Network  Compression  For  Map  Matching

ACM  SigSpatial IWGS  2014

Page 11: Societal Impact of Applied Data Science on the Big Data Stack

PreGo

Dynamic  Multi-­‐Preference  Routing

Single  Attribute

MultipleAttribute

Time-­‐Homogenous Dijkstra,  A* Stewart  et  al  91

Time-­‐Variant Betsy  et al  07 ?

<3,4>

<2,2>

<5,7>

<0,0>

a

s

b

e

T=[1,2,3,4,5]R=[1,2,3,4,5]

T=[1,2,3,4,5]R=[1,2,3,4,5]

d

c g

f h

T=[1,2,3,4,5]R=[1,2,3,4,5]

T=[1,2,3,4,5]R=[1,2,3,4,5]

T=[5,1,3,4,5]R=[7,1,2,4,5]

T=[1,1,3,4,5]R=[1,2,3,4,5]

T=[2,1,3,4,5]R=[2,1,3,4,5]

T=[1,2,2,4,3]R=[2,1,5,4,3]

T=[1,2,3,1,1]R=[1,2,3,0,1]

<1,1>

<4,4>

T=[4,2,1,3,5]R=[3,2,1,4,5]

Page 12: Societal Impact of Applied Data Science on the Big Data Stack

Special Needs Education: Teacher Trainer Effectiveness AnalysisCustomized Surveys

Training Registration

Survey Management

To  support  streamlined  data  collection  and  performance  evaluation  across  the  State  Needs  

Projects.Project Stakeholders

Office of the Superintendent of Public Instruction

Center for Data Science

Data Dashboard Purpose Report Generation

Geographic Distribution Maps

Demographic Reports

Brad Porter, Aniruddha Desai, Yitao Li, David Hazel, Michelle Maike, Greg Benner, Ankur Teredesai, Leslie Pyper, Vickie Green

Page 13: Societal Impact of Applied Data Science on the Big Data Stack

Systems  Biology

13

Predictive  Models  and  software

Applications:  Personalized  medicine,  drug  discovery

Focus:  Develop  machine  learning  methods  and  tools  to  effectively  integrate  multiple  big  data  sources  in  biology.

Page 14: Societal Impact of Applied Data Science on the Big Data Stack

A  Flying  Hadoop Cluster

14

Page 15: Societal Impact of Applied Data Science on the Big Data Stack

Detecting  Malware  Activity  based  on  Automatically  Generated  Domains

Command  &  Control  

xyz.com xyz.com

Infected  node

Partnering  with  NIARA  we  obtained  a  large  dataset  of  Automatically  Generated  Domains.  

Based    on  the  intercepted  domain  features  we  are  able  to  identify  the  malware  infecting  a  network.  

Page 16: Societal Impact of Applied Data Science on the Big Data Stack

(March  2012)• Will  this  Heart  Failure  patient  get  readmitted  within  30  days?• Yes  or  No  (Binary  Classification)

16

Reduce  CHF  Readmission

Readmission  ?Machine  Learning?

Joint  NSF  /  NIH  Solicitation  on  Health  Care  and  Big  Data

Page 17: Societal Impact of Applied Data Science on the Big Data Stack

Affordable  Care  Act  =>  Avoidable  CostsReadmissions  are  AVOIDABLE

20%32%

30  days

60  days

75%

25% Non  CHF

CHF

• Readmissions  national  cost  $17  billion  annually

• 76  %  considered  avoidable  

17

Readmissions

Congestive  Heart  Failure  (CHF)

Source:  www.presidency.ucsb.edu,  cdc.gov,  tmz.com

Page 18: Societal Impact of Applied Data Science on the Big Data Stack

Patient

Class Labels

No  readmissionReadmission

CHF  ROR:  30-­‐Day  Hospital  Readmission  Risk  Prediction

Machine  Learning    Algorithms

18

Building  the  

model

Scoring  the  tuple

FeaturesVector

FeaturesVectors

New  patient

No  readmissionReadmission

Page 19: Societal Impact of Applied Data Science on the Big Data Stack

19

Some of the Steps

Data  UnderstandingAnd  Integration

Data  Cleaning

Data  Transformation

Extracting    data  from  Epic  -­‐16  data  marts  and  200  views:Heart Failure  Inpatient  SummaryEncounter.FlowsheetPatientEncounterHospital

vs  

Page 20: Societal Impact of Applied Data Science on the Big Data Stack

Public  Data:State  Inpatient  Dataset  2009-­‐2012

20

AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE   DXCCS1 PRCCS1 TOTCHG52 98122 1 3 12 3 0 153 212 56,51187 98109 1 3 7 1 1 162 -­‐ 12,68726 98028 4 3 1 30 1 139 195 127,300

• Washington  State  Inpatient  Data• Admission  level  Claims  • ~400  attributes  

• Demographics• ICD9  Diagnosis  codes• ICD9  Procedure  codes• Charges

• Admissions  by  year• 2009  – 652702• 2010  – 651783• 2011  – 648079• 2012  – 648092

Page 21: Societal Impact of Applied Data Science on the Big Data Stack

Variety  and  Volume  (2/3  V’s  of  Big  Data)

Pre  Admission Post  Admission Pre-­‐ Discharge Discharge-­‐ Demographics

-­‐ Vital  Sign

-­‐Prior  Hospitalization

Pulse  rate            Blood  pressure  Respiration  rate  BMI

Number  of    prior  admissionsPrior  length  of  stay

+ Demographics

Sodium  levelGlucose  levelHemoglobin  levelCreatinine  level

Hematocrit  levelNeutrophils  level

Ejection  Fraction  BUN  level

+ Vital  Sign+ Prior  Hospitalization-­‐ Lab  Test

+ Vital  Sign+ Prior  Hospitalization

+ Demographics

+  Lab  Test-­‐ Diagnosis  InformationNumber  of  secondary  diagnosisChronic  systolic  heart  failure  Acute  kidney  failure    Chest  painHyper  potassemia  BronchopneumoniaOther  chronic  pulmonary  heart  diseases  Syncope  and  collapse        …

+ Prior  Hospitalization+ Demographics

-­‐ ComorbiditiesAcute  coronary  syndrome    AsthmaCOPD    Ulcer    Dialysis    DementiaArrhythmias    Mal  Nutrition  Vascular    Depression

-­‐ Discharge/Admit  codesAdmit  /Discharge  typeSeverity  Of  illness    Risk  Of  Mortality  

-­‐ Utilization  InformationOperating  room  CTSCANEmergency  Room        CCU

Marital  status          AgeRacial  group      Gender

Page 22: Societal Impact of Applied Data Science on the Big Data Stack

(Dec  2012)  Initial  Models  

22

Data  integration

Feature  Construction

Predictive  modeling

• Logistic  Regression• Naïve  Bayes• Support  Vector  Machines

0.6

0.72

0.64

0.540.560.580.6

0.620.640.660.680.7

0.720.74

Yale  Model  (Comparative  …

Amarasingham  et  al.  

Our  current  Result

Area  Und

er  th

e  Cu

rve  (AUC)

Several  Rejects:  KDD  Industry  Track  2013AMIA  2013JAMIA  2013

2012

Page 23: Societal Impact of Applied Data Science on the Big Data Stack

(July  2013)  (much  better)   &  Some  Papers

§ Improved  data  exploration§ S.-­‐C. Chin, K. Zolfaghar, S. Basu Roy, A.Teredesai, and P. Amoroso, "Divide-­‐n-­‐Discover -­‐-­‐ Discretization based DataExploration Framework forHealthcare Analytics," 7thInternational Conference on HealthInformatics (HEALTHINF Short Paper),Angers, France, 2014

§ N. Meadem, N. Verbiest, K. Zolfaghar,J. Agarwal, S.-­‐C. Chin, S. Basu Roy, A.Teredesai, D. Hazel, P. Amoroso, andL. Reed, "Predicting Risk ofReadmission for Congestive HeartFailure Patients," Workshop on DataMining for Healthcare (DMH),Chicago, IL, 2013

23

0.6

0.720.64

0.74

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Yale  Model  (Comparative  Baseline)

Amarasingham  et  al.  

Our  2012  Result Our  current  Result

Area  Und

er  th

e  Cu

rve  (AUC)

§Improved  Modeling Effort

Page 24: Societal Impact of Applied Data Science on the Big Data Stack

(Dec  2013)  Prototype  or  a  possible  Product?  &  yes,  More  Papers§ Successful  Deployment

24

§K. Zolfaghar, J. Agarwal, D. Sistla, S.-­‐C. Chin, S. Basu Roy, and N. Verbiest, "Risk-­‐O-­‐Meter: An IntelligentClinical Risk Calculator," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD), Chicago, IL, 2013

§Kiyana Zolfaghar, Naren Meadem, Ankur Teredesai, Senjuti Basu Roy, Si-­‐Chi Chin, Brian Muckian: Bigdata solutions for predicting risk-­‐of-­‐readmission for congestive heart failure patients. BigDataConference 2013: 64-­‐71

Page 25: Societal Impact of Applied Data Science on the Big Data Stack

25

Multi  Layer  Classifier  :  Automatically  Detecting  Classification  Windows

Will  patient ever readmit?

Will  patient readmitwithin 30  days?

YES NO

YES NO

KNN

LRNBSVMKNN

32%  of  all  data

Only 5%  of  patients that returnwithin 30  days is  filtered out

Page 26: Societal Impact of Applied Data Science on the Big Data Stack

Generalizing  the  30,60,90  Day  readmission

§ Automatic  design  of  time  prediction  hierarchy§ Feature  selection  and  factor  analysis  at  each  layer§ Different  classification  algorithms  in  each  layer  and  satisfying  different  quality  metrics

26

Page 27: Societal Impact of Applied Data Science on the Big Data Stack

Automatic  design  of  prediction  hierarchy

27

Page 28: Societal Impact of Applied Data Science on the Big Data Stack

Simple  3  Layer  Example

• Stage  1:  Design  a  predictive  model  for  the  patients  who  are  likely  to  come  back  within  a  time  window  of  (X,  K),  where  X  is  the  maximum  number  of  days  until  next  readmission• Stage  2:  Design  a  predictive  model  for  time  window  of  (K,  30)• Stage  3:  Design  a  predictive  model  for  time  window  of  <30  days  of  readmission

HOW  TO  AUTOMATICALLY  DETECT  THE  MIDDLE  CUTPOINT  K?

28

Page 29: Societal Impact of Applied Data Science on the Big Data Stack

Hill  Climbing  Algorithm  to  Detect  K

§Generate  a  random  number    K  between  X  and  30§ Compute   C1=  Centroid(X,K)  ,  C2=  Centroid(K+1,30)§ Compute  the  KLCurrent =  KLDiv(C1,C2)§ K’=K+i K”=K-­‐i§ Find  a  point  K2  between  (K’,K’’)  ,  and  check§ If  KLDiv(  Centroid(X,K2),  Centroid(K2,30))  >  KLCurrent§ If  the  above  condition  is  satisfied,  then  K=K2§ KLCurrent =  KLDiv(  Centroid(X,K2),  Centroid(K2,30))  § Repeat  the  above  steps  until  no  further  check  is  possible

29

Page 30: Societal Impact of Applied Data Science on the Big Data Stack

30

Calculating  the  Probability  of  30  day  RoR

P(readmit ≤ 30) = P(≤ 30 |≤ K )×P(≤ K |Y )P(Y )

Page 31: Societal Impact of Applied Data Science on the Big Data Stack

Risk-­‐O-­‐MeterDistinguishing  Features

31

Risk-­‐O-­‐Meter

Users

Current  Systems

Healthcare  provider

and  Patients

Only  

healthcare  providers

Result  explanation

and  exploration

Need  deep  domain  Knowledge

Handle  incomplete  patient  input

Page 32: Societal Impact of Applied Data Science on the Big Data Stack

All  in  one  Package  – Risk-­‐O-­‐Meter  (KDD  2013)

32

Page 33: Societal Impact of Applied Data Science on the Big Data Stack

Pre  Admission Post  Admission Pre  -­‐ Discharge Discharge

Post-­‐Discharge  Care  

Management  Pipeline

“White  Gap”PCP HF  ServiceCare  Management

Payer

ChroniRisk Continuous  Readmission  Risk  Assessment  Across  Continuum  of  Care*

78%*

42%*

Service  Line  EMRPCP  Tools Psycho-­‐social  risk  scoring

2013  HF  Readmission  Statistics• 7.1  M  Readmits• 5.3  M  Avoidable• $13,000  each• $13  B  opportunity  cost

Patient  Encounters  Scored

+18,000 (HF  cohort)

Page 34: Societal Impact of Applied Data Science on the Big Data Stack

Risk  – DoneCost  – DoneNext?  

Actionable  InterventionsIf  we  can  predict  can  we  recommend?

34

A  Framework  to  Recommend  Interventions  for  30-­‐Day  Heart  Failure  Readmission  Risk,  Rui Liu,  KiyanaZolfaghar,  SC  Chin,  Senjuti Basu Roy,  Ankur  Teredesai,  Data  Mining  (ICDM),  2014  IEEE  International  Conference  on  DOI:  10.1109/ICDM.2014.89  Publication  Year:  2014  ,  Page(s):  911  -­‐ 916

Page 35: Societal Impact of Applied Data Science on the Big Data Stack

A  real  and common Chronic  Readmission

75-­‐year  old,  female

Chronic  pulmonary  disease,  depression,  hypertensionand  diastolic  heart  failure  

High Risk

Medium Risk

Low Risk

35

Readmit!Intervention  Plan  1

Major  Operating  Room,  Chest  X-­‐ray  and  others

Intervention  Plan  2

Echocardiology,  CCU  and  others

Intervention  Plan  3

Emergency  Room  and  others

Page 36: Societal Impact of Applied Data Science on the Big Data Stack

Risk  will  be  lower  when  the  interventions  are  performed

The  patient  is  not  readmitted

Intervention  Rule  Generation

Readmission

Age Gender

PneumoniaDX486

Acute respitory

failureDX51881

CHF DX4280

Cont inv mec ven <96 hrsPR9671

Venous cath NECPR3893

Packed cell transfusion

PR9904 Rule  Repository

Valid  Rule 1

Female, Diabetes,  Major  Operating  Room,  Chest  X-­‐ray  and  others

Valid  Rule 2

Male, Hypertension, Echocardiology,  CCU  and  others

Invalid Rule 3

Female,  Depression,  Emergency  Room  and  others

Invalid  Rule  4

Male,  COPD,  Emergency  Room  and  others

36

Bayesian NetworkConstruction

Intervention  Rule  Generation

Intervention  Recommendation

Evaluation

Compute patient risk using only non-­‐procedural attributes

Compute patient risk using proceduralattributes

Compare the difference between the twoprobabilities

Store the rules where the risk isreduced after introducing the

procedures

Page 37: Societal Impact of Applied Data Science on the Big Data Stack

Recommendation  for  New  Patient

Intervention  Plan  1

Major  Operating  Room,  Chest  X-­‐ray  and  others

Intervention  Plan  2

Echocardiology,  CCU  and  others

Intervention  Plan  3

Emergency  Room  and  others

Top 3 intervention plans

Rule  Repository

New  Patient  Attributes

Summarized  Intervention  Plan

Major  Operating  Room,  Echocardiology ,  Chest  X-­‐ray  and  others

37

Summarize

The Rule Repository is  HUGE!  (over  30k  rules)Parallel Solution!

Bayesian NetworkConstruction

Intervention  Rule  Generation

Intervention  Recommendation

Evaluation

Compute similarity between establishedattribute profile and a given patient profile

Identify rules where the establishedattribute is most similar to the patient

input

Recommend interventions extractedfrom the established rules

Page 38: Societal Impact of Applied Data Science on the Big Data Stack

Validation  – Data  Highlights

• State  Inpatient  Database  (SID) of  Washington  State  heart  failure  cohort  in  year  2010  (67967  patients) for training and 2011 (52021 patients)  for  testing

• 3908  diagnosis  and  2049  procedure  codes  are  involved.

• Feature  Selection  is  performed  using  chi-­‐square  test.

Demographics Age,  Gender,  Race

Comorbidity  &  Diagnosis 21  comorbidities  and  90  diagnosis

Utilization  &  Interventions 21 health  service  utilization  flags  and  70  interventions

Others Length of  Stay,  #  of  diagnosis  and  interventions

38

High Dimensional

Bayesian NetworkConstruction

Intervention  Rule  Generation

Intervention  Recommendation

Evaluation

Extract patients from the test set who were notreadmitted within 30 days

Compute the evaluation metrics between the recommended interventionsand the actual interventions

Page 39: Societal Impact of Applied Data Science on the Big Data Stack

Validation – Experiment Results

39

0

100

200

300

400

Linear  Regression

Hill-­‐Climbing Grow-­‐Shrink Hybrid

Hits

0.340.350.360.370.380.390.4

Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid

Jaccard  Index

0.93

0.932

0.934

0.936

0.938

0.94

0.942

Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid

Accuracy

0.45

0.5

0.55

0.6

0.65

Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid

True  Positive  Rate

Bayesian NetworkConstruction

Intervention  Rule  Generation

Intervention  Recommendation

Evaluation

Page 40: Societal Impact of Applied Data Science on the Big Data Stack

Back  to  the  Chronic  Readmission  Case

75-­‐year  old,  female

Chronic  pulmonary  disease,  depression,  hypertensionand  diastolic  heart  failure  

40

No-­‐readmit!Cardiac  catheterization  lab,  CT  scan,  echo-­‐cardiology,  echo-­‐cardiogram,  

Cardiac  catheterization  lab,  CT  scan,  echo-­‐cardiology,  echo-­‐cardiogram

Page 41: Societal Impact of Applied Data Science on the Big Data Stack

Accountable  Care  OrganizationsCost/Charge  Prediction

41

HealthSCOPE:  An  Interactive  Distributed  Data  Mining  Framework  for  Scalable  Prediction  of  Healthcare  Costs  ,  Marquardt  James,  Newman  Stacey,Hattarki Deepa,  Srinivasan Rajagopalan,  Sushmita Shanu,  Ram  Prabhu,  Prasad  Viren,  Hazel  David,  Ramesh  Archana,  De  Cock  Martine,  Teredesai  Ankur,  IEEE  Data  Mining  Conference  Demo  Track,  2014  IEEE  International  Conference  on  DOI:  10.1109/ICDMW.2014.45  Publication  Year:  2014  ,  Page(s):  1227  -­‐1230

Page 42: Societal Impact of Applied Data Science on the Big Data Stack

42

What  are  healthcare  costs  for  assigned  population  in  2015  ?

Why  is  the  cost  so  high  or  low  ?

How  does  the  cost  distribute  across  demographics  ?

QUESTIONS

DATA  

SCIENCE

DATA

APPLICATIONS

Motivation:  ACO  Cost  Prediction

Demographics Diagnosis  Codes

Procedure  Codes

DrugsLab  ResultsC

linical

Claims

Sources  :  SID,  OSHPD,  MEPS Source  :  MultiCare  Collaboration

ChargesVitals

Population Predictive  Modeling

Feature  Prioritization Health  Prediction

Care  Management

Individual Predictive  Modeling

Chandola et.  al,  KDD  2013  

Page 43: Societal Impact of Applied Data Science on the Big Data Stack

Cost/Charge  Prediction:  Problem  Description• Goal  à predict  the  future  healthcare  cost  of  individuals  based  on  

their  past  medical  and  cost information.• Supervised  machine  learning  problem.• Input:• Previous  health  information  (e.g.  diagnosis,  comorbidities,  etc).  • General  demographics  (age,  gender,  race)• Previous  healthcare  cost• {X}  =  (x1,  x2,  x3 ......xp)

• Output:• Y  =  Future  healthcare  cost

foo 43

Page 44: Societal Impact of Applied Data Science on the Big Data Stack

foo 44

Four  Scenarios  for  predicting  cost  

• Three  Months  of  Historical  data  (Medical,  Demographic  and  Cost)à Cost  of  Following  Nine  months  (1Q)

• Six Months  of  Historical  data  (Medical,  Demographic  and  Cost)à Cost  of  Following  Six  months  (2Q)

• Nine Months  of  Historical  data  (Medical,  Demographic  and  Cost)à Cost  of  Following  Three  months  (3Q)

• Twelve    Months  of  Historical  data  (Medical,  Demographic  and  Cost)à Cost  of  Following  Twelve    months  (4Q)

Page 45: Societal Impact of Applied Data Science on the Big Data Stack

Non-­‐ Gaussian  Distribution  of  Healthcare  Costs

foo 45

Makes  it  challenging  and  interesting  problem  for  research

Page 46: Societal Impact of Applied Data Science on the Big Data Stack

Existing  Cost  prediction  Methods• Limited  to  Rule  based  or  Multiple  Linear  Regression  methods• Rule  Based  methods  

• Requires  domain  knowledge• Expensive

• Multiple  Linear  Regression• Multi-­‐collinearity Issue• Sensitive  to  extreme  values  (outliers)

• Evaluation• Estimate    the    mean    cost    of    the    given    sampling    distribution.• Often  in-­‐sample  data  used  to  report  predictive  performance.• R2   evaluation  metric (not  a  true  indicator)

Page 47: Societal Impact of Applied Data Science on the Big Data Stack

Our  Contributions

• Investigate  the  utility  of  state-­‐of–the  –art  machine  learning    algorithms  for  the  cost  prediction  problem.  • We  empirically  evaluate  three  algorithms:• Regression  Trees• M5  Model  Trees• Random  Forest

foo 47

Page 48: Societal Impact of Applied Data Science on the Big Data Stack

Regression  Tree

48

Age  >  60?

Has  Asthma?

Gender  =  Female?

21,00046,00062,00085,000

Yes

Yes Yes

No

No No

Page 49: Societal Impact of Applied Data Science on the Big Data Stack

M5  Model  Tree

foo 49

Has  Asthma?

Gender  =  Female?

Yes

Yes Yes

No

No No

Age  >  60?

Page 50: Societal Impact of Applied Data Science on the Big Data Stack

Random  Forest

50

Had  Procedure  

X?

Age  >  18?Gender  =  Male?

21,00046,00062,00085,000

Yes

Yes Yes

#  Admits  >  3?

No No

Race  =  White?Has  CHF?

21,00046,00062,00085,000

Yes

YesYesNo No

No

NoAge  >  60?

Has  Asthma?

21,000

Gender  =  Female?

46,00062,00085,000

Yes

Yes

YesNo No

No

Page 51: Societal Impact of Applied Data Science on the Big Data Stack

51

Evaluation  Metrics• Mean  Absolute  Error  (MAE)

• Root  Mean  Squared  Error  (RMSE)

Page 52: Societal Impact of Applied Data Science on the Big Data Stack

52

MAE  Results  – SID  Data  (3Q  Scenario)

0

5,000

10,000

15,000

20,000

25,000

30,000

Average  Baseline

Previous  Cost  

Regression

Multiple  Linear  

Regression

Regression  tree

Random  Forest

Model  Tree

MAE

 ($)

Baselines

Advanced  Models

Page 53: Societal Impact of Applied Data Science on the Big Data Stack

53

MAE  Results  – MEPS  Data

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Average  Baseline

Previous  Cost  

Regression

Multiple  Linear  

Regression

Regression  tree

Random  Forest

Model  Tree

MAE

 ($)

Baselines

Advanced  Models

Page 54: Societal Impact of Applied Data Science on the Big Data Stack

54

Prediction  Error  Results  – M5  Model  Trees

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

1Q 2Q 3Q 4Q

Error  ($)

MAE

RMSE

Page 55: Societal Impact of Applied Data Science on the Big Data Stack

Error  Distribution:  WA  State  SID  Data

foo 55

For  large  fraction  of  of  the  population  (75%),  we  were  able  to  predictwith    higher    accuracy    using    these    algorithms

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0% 25% 50% 75%

Maxim

um  Prediction  Error  ($)

Portion  of  Population

Multiple  Linear  Regression

Regression  Tree

Random  Forest

Model  Tree

Page 56: Societal Impact of Applied Data Science on the Big Data Stack

Sub-­‐Population  Cost  Prediction

Prediction

Prediction

Prediction

PopulationSub-­‐Population

FutureHealthcareCost

Congestive  heart  failure  (CHF)

Diabetes

COPD

Asthma

Coronary  artery  disease  (CAD)

Age  65+

Page 57: Societal Impact of Applied Data Science on the Big Data Stack

Most  difficult  cohort  to  predict

foo 57

0

5000

10000

15000

20000

25000

30000

35000

Asthma Diabetes CHF COPD Coronary Over  65

MAE

 ($)

model  trees

linear  regression

Page 58: Societal Impact of Applied Data Science on the Big Data Stack

Engineering  the  Solutions:  Risk-­‐O-­‐Readmission  &  Cost-­‐As-­‐a  Service

58

Page 59: Societal Impact of Applied Data Science on the Big Data Stack

Thu,  Nov  7,  2013  at  10:50  AM

59

-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Forwarded  message  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐From:  Windows  Azure  Pass  System  Admin  <[email protected]>Date:  Thu,  Nov  7,  2013  at  10:50  AMSubject:  Gifting  Letter  for  Windows  Azure  Research  PassTo:  "Ankur  M.  Teredesai"  <[email protected]>Cc:  "Azure4Research  (RFP  External)"  <[email protected]>

Dear  Ankur  M.  Teredesai  ,

We  have  approved  your  application  for  a  Windows  Azure  Research  Pass  Grant.  In  order  to  receive  your  pass,  download  the  Microsoft  gifting  letter  from  the  following  link:

Page 60: Societal Impact of Applied Data Science on the Big Data Stack

Risk-­‐of-­‐Readmission  as  a  Service

60

Page 61: Societal Impact of Applied Data Science on the Big Data Stack

Web  App  for  ACOs

Model  

Selector

Cost  Prediction  API

Beneficiary    ClaimsPopulation  Batch/Individual

A

B

Linear  Regression

Regression  Trees

Individual  BeneficiaryFeature  Vector

Individual  BeneficiaryPredicted  Cost

Predicted,  Previous  year,  Historic    population  Costs  +  population  statistics

Scale  Issues:Cost  Prediction  as  a  Service

R

Big  Data  Stack

Cost  Prediction  Engine

Model  Bank  deployed  on  ADAPA

Spark

Beneficiary    Claims  for  individual

Predicted cost  for  the  individual④

Web App  for

Individual

WA-­‐SID  Claims  /  MEPS  Survey  (for  training)

Data  Sources

C M5  Model  Trees

Page 62: Societal Impact of Applied Data Science on the Big Data Stack

Web  App  for  ACOs

Model  

Selector

Cost  Prediction  API

Beneficiary    ClaimsPopulation  Batch/Individual

A

B

Linear  Regression

Regression  Trees

Individual  BeneficiaryFeature  Vector

Individual  BeneficiaryPredicted  Cost

Predicted,  Previous  year,  Historic    population  Costs  +  population  statistics

Cost  Prediction  as  a  Service

R

Big  Data  Stack

Cost  Prediction  Engine

Model  Bank  deployed  on  ADAPA

Spark

Beneficiary    Claims  for  individual

Predicted cost  for  the  individual④

Web App  for

Individual

Data  Sources

WA-­‐SID  Claims  /  MEPS  Survey  (for  training)

C M5  Model  Trees

Page 63: Societal Impact of Applied Data Science on the Big Data Stack

Apache  Spark

foo 63

Apache  SparkHDFS

Slave  1

Slave  1

Master

Driver RDD

In  Memory  DataPartition  1

In  Memory  DataPartition  2

Spark

Spark

Spark

Data  Partition1

Replica  Data  Partition2

Data  Partition2

Replica  Data  Partition2

Page 64: Societal Impact of Applied Data Science on the Big Data Stack

Weighted  k-­‐NN  for  Regression

foo 64

Data  Partition  1

kNN1

Pred

icted  Co

st

kNN2

2k  NN

kNN

Node  1

Data  Partition  2

Node  2

Test  Instance Top  k

Group  

&  Sort

Group  &  SortWeighted  Average

Comput

e  

kNN

Compute  kNN

Page 65: Societal Impact of Applied Data Science on the Big Data Stack

Rough  Set

• Rough set theory is an ML framework that is especially suitable for information systems with inconsistencies.• Rough set theory handles discrete

attributes.

• Lower approximation: instances that necessarily belong to the class

• Upper approximation: instances that possibly belong to the class

Patient Age  ≥  50 Alcohol  Disorder  Visit Cost

P1 Yes Yes HighP2 Yes Yes HighP3 Yes No LowP4 Yes No HighP5 No No LowP6 No Yes High

Similar  Patients  but  belong  to  different  classes!

Page 66: Societal Impact of Applied Data Science on the Big Data Stack

Fuzzy  Rough  Set

• Uses  fuzzy  logic  to  handle  continuous  attributes.• Similarity  matrix  contains  values  between  0  and  1.  • Inconsistent  instances  are  highly  related  but  have  a  different  class.

Patient Age Alcohol  Disorder  Visits   CostP1 52 1 $13335P2 59 4 $277966P3 55 0 $8139P4 50 0 $66058P5 34 0 $5815P6 26 1 $38526

P1 P2 P3 P4 P5 P6P1 1 0.52 0.83 0.84 0.60 0.61P2 0.52 1.00 0.44 0.36 0.12 0.13P3 0.83 0.44 1 0.92 0.68 0.44P4 0.84 0.36 0.92 1 0.76 0.51P5 0.60 0.12 0.68 0.76 1 0.75P6 0.61 0.13 0.44 0.51 0.75 1

Page 67: Societal Impact of Applied Data Science on the Big Data Stack

Fuzzy  Rough  Set

• Let rj,i be the degree of similarity of instances i and j.

• Let ci be the degree to which instance i belongs to the class.

• Then the degree to which instance j belongs to the:

• Lower approximation of the class is: min{max(1-rj,,i, ci) | i = 1,...,n}

• Upper approximation of the class is: max{min(rj,i, ci) | i = 1,...,n}

• Current implementations can handle only up to 100,000 instances because they keep the similarity matrix in memory.

Page 68: Societal Impact of Applied Data Science on the Big Data Stack

Fuzzy  Rough  Set  

max{min(rj,i, ci) | i = 1,...,n}

Page 69: Societal Impact of Applied Data Science on the Big Data Stack

Fuzzy  Rough  Set

min{max(1-rj,,i, ci) | i = 1,...,n}

Page 70: Societal Impact of Applied Data Science on the Big Data Stack

Implementation

• The construction of the similarity matrix can be done in a

parallel manner, making each of K compute nodes calculate n/K columns of the similarity matrix.

• No need to store the similarity matrix as a whole. • The construction of the similarity matrix

does not have to befinished before (partial) computation of

the lower and upperapproximations can begin.

Node  1 Node  2

Page 71: Societal Impact of Applied Data Science on the Big Data Stack

Implementation  -­‐Lower  Approximation

Upper  Approximation

Page 72: Societal Impact of Applied Data Science on the Big Data Stack

Spark  vs MPI              Fuzzy  Rough  Set

Page 73: Societal Impact of Applied Data Science on the Big Data Stack

Web  App  for  ACOs

Model  

Selector

Cost  Prediction  API

Beneficiary    ClaimsPopulation  Batch/Individual

A

B

Linear  Regression

Regression  Trees

Individual  BeneficiaryFeature  Vector

Individual  BeneficiaryPredicted  Cost

Predicted,  Previous  year,  Historic    population  Costs  +  population  statistics

Cost  Prediction  as  a  Service

R

Big  Data  Stack

Cost  Prediction  Engine

Model  Bank  deployed  on  ADAPA

Spark

Beneficiary    Claims  for  individual

Predicted cost  for  the  individual④

Web App  for

Individual

WA-­‐SID  Claims  /  MEPS  Survey  (for  training)

Data  Sources

C M5  Model  Trees

Page 74: Societal Impact of Applied Data Science on the Big Data Stack

Readmission  Application

• Android• Windows  Phone• Patient  View• what  is  my  risk

• Doctor  View  • who  are  my  risky  patients?• alerts• Interventions

74

Page 75: Societal Impact of Applied Data Science on the Big Data Stack

foo 75

http://healthscope.cloudapp.net/hscope-­‐dev/aco/

Healthcare  Scalable  COst  Prediction  Engine  (HealthSCOPE)

Page 76: Societal Impact of Applied Data Science on the Big Data Stack

0.6  AUCYale  Model(Baseline)

76

Milestones:  Readmission  Risk

0.64  AUCUW  2012  Result

Ensemble  method,  

Hierarchical  classification

Dec  2012

0.74  AUCUW  2014Result

Lab  results+

New  Algorithm  (Adaboost)

Feb    2014

QlikViewReadmission  

App

Dec  2013

Machine  Learning  Process  to  Target  New  Chronic  Diseases

Aug  2014  -­‐>  Moving  Forward

Integrating  care  pathway  

March  2014

Bayesian  Network  Learning

AUC  – Accuracy  measure  (Area  Under  Curve)

Real  Time  Care  

Factors  &  Pathways

July  2014

with  EPIC

Post-­‐Discharge(Clinical    data)

June  2013

Risk-­‐o-­‐MeterDevelopment

+  Big  data  Efforts

Pre-­‐Admission(Clinical    data)

Post-­‐Discharge(Claim  data)

Post-­‐Admission(Clinical  data)

IEEE  Big  DataREF  #3

KDDREF  #1  &  2

HEALTHINFREF  #4  &  5

KDDREF  #6

ICDM  2014REF  #6

Page 77: Societal Impact of Applied Data Science on the Big Data Stack

Problem  Explorat

ion

77

Milestones:  Cost  Prediction

H-­‐SCOPE  ISID  Data

June  2014

H-­‐SCOPE  IVSID  +  MEPS  

data

Nov.  2014

H-­‐SCOPE  IIIAdapa Scoring  

EngineSpark  

Framework

Sept.  2014 Aug  2015  -­‐>  Moving  Forward

H-­‐SCOPE  VFive  Cohort

Dec.  2014

M5  Model  Trees

Random  Forest

Regression  Tress

HealthSCOPE  VI

July  2015

Admit  Level

August  2014

H-­‐SCOPE  IIPopulation  View  

(ACO)OSHPD  Data  Application

Beneficiary  Level

Beneficiary  View

Four  Future  Scenario  

ICDM  2014 KDD-­‐2015 AMIA-­‐2015

Sub-­‐Population

DeepLearning

Time  &Cost  OfHospital

readmission

H-­‐SCOPE  VIIAHRQ  Private

data

WWW-­‐DigitalHealth-­‐2015

Time,  CostAnd  

Illness  (Alignment)Prediction  

Page 78: Societal Impact of Applied Data Science on the Big Data Stack

78

AUC  – Accuracy  measure  (Area  Under  Curve)

2012

78

Milestones:  Merging  Threads

2016  and  beyond2013 2014 2015

Risk  of  Readmission  (Clinical,  Sociological  &  Claims)

2014 2015

Cost  Prediction  (Claims  and  secondary  data  sources)

2015

Risk  &  Cost  Convergence

Page 79: Societal Impact of Applied Data Science on the Big Data Stack

Flat  Files  CSV Claims  X12 Clinical    HL7

Distance  Compute  Library

Instance  Selection  

RNGE Drop  3

Fuzzy  Rough  Set  Approximation

CHF  Risk  of  Readmission

Geo  Routing

Random  Forests KNN

Industry  Partners  and  Domain  Experts

Other  Solutions

HDFS NUMA

MPI Grappa

Census  US  Gov Unstructured  CCD

Bayesian  Networks

Support  VectorMachines

79

Cost  of  Chronic  Interventions

Age/Gender  Prediction

Malware  Analytics

Personalized  Cancer  Therapy

ETL  Tools

Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry)

Sqoo

p

Page 80: Societal Impact of Applied Data Science on the Big Data Stack

Flat  Files  CSV Claims  X12 Clinical    HL7

Distance  Compute  Library

Instance  Selection  

RNGE Drop  3

Fuzzy  Rough  Set  Approximation

Personalized  Cancer  Therapy

Geo  Routing

Random  Forests KNN

Industry  Partners  and  Domain  Experts

Other  Solutions

HDFS NUMA

MPI Grappa

Census  US  Gov Unstructured  CCD

Bayesian  Networks

Support  VectorMachines

80

Cost  of  Chronic  Interventions

Age/Gender  Prediction

Malware  Analytics

CHF  Risk  of  Readmission

ETL  Tools

Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry)

Sqoo

p

Page 81: Societal Impact of Applied Data Science on the Big Data Stack

81

Our  Sincere  Thanks for  Your  Support!