p.m.pardalos - on big data application

35
Big Data Case Study Proposed Model Results and Conclusions On Big Data Applications P. M. Pardalos 1 , 2 1 Center for Applied Optimization Department of Industrial and Systems Engineering University of Florida 2 Laboratory of Algorithms and Technologies for Networks Analysis (LATNA) National Research University Higher School of Economics ItForum, 2013 P. M. Pardalos, On Big Data Applications

Upload: ekaterina-morozova

Post on 31-Oct-2014

607 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

On Big Data Applications

P. M. Pardalos1, 2

1Center for Applied OptimizationDepartment of Industrial and Systems Engineering

University of Florida

2Laboratory of Algorithms and Technologies for Networks Analysis (LATNA)National Research University Higher School of Economics

ItForum, 2013

P. M. Pardalos, On Big Data Applications

Page 2: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

P. M. Pardalos, On Big Data Applications

Page 3: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

How big will the data be tomorrow?

Cisco: Global ’Net Traffic to Surpass 1 Zettabyte in 2016http://cacm.acm.org/news/150041-cisco-global-net-traffic-to-surpass-1-zettabyte-in-2016/fulltext

P. M. Pardalos, On Big Data Applications

Page 4: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

What is BigData?

The proliferation of massive datasets brings with it a seriesof special computational challenges.This data avalanche arises in a wide range of scientific andcommercial applications.With advances in computer and information technologies,many of these challenges are beginning to be addressed.

P. M. Pardalos, On Big Data Applications

Page 5: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

HandBook of Massive Datasets

Handbook of Massive Data Sets,co-editors: J. Abello, P.M. Pardalos,and M. Resende, Kluwer AcademicPublishers, (2002).

P. M. Pardalos, On Big Data Applications

Page 6: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Data Mining: the practice of searching through largeamounts of computerized data to find useful patterns ortrends.Optimization: An act, process, or methodology of makingsomething (as a design, system, or decision) as fullyperfect, functional, or effective as possible; specifically :the mathematical procedures (as finding the maximum of afunction) involved in this.

P. M. Pardalos, On Big Data Applications

Page 7: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Hard drive Cost

P. M. Pardalos, On Big Data Applications

Page 8: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Hard drive Capacity

P. M. Pardalos, On Big Data Applications

Page 9: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Processing power

P. M. Pardalos, On Big Data Applications

Page 10: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Some Examples of Massive Datasets

Internet DataWeather DataTelecommunications DataMedical DataDemographics DataFinancial Data

P. M. Pardalos, On Big Data Applications

Page 11: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Handling Massive Datasets

New computing technologies are needed to handle massivedatasets

Input-Output Parallel SystemsExternal Memory AlgorithmsQuantum Computing

P. M. Pardalos, On Big Data Applications

Page 12: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Study Objective

An example in biomedicine

Acute Kidney Injury (AKI) is one of common complicationsin post-operative patients.The in-hospital mortality rate for patients with AKI may beas high as 60%.An elevated serum creatinine (sCr) level in blood iscommonly recognized as an indicator of AKI.The degree to which patterns of sCr change areassociated with in-hospital mortality is unknown.

P. M. Pardalos, On Big Data Applications

Page 13: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Study Objective

Study Outline

Objectives:

To develop a comprehensive model for assessing themortality risk in post-operative patientsTo establish a quantitative relationship between sCr patternand mortality risk

Results:

A probabilistic model was designed to assess mortality riskin post-operative patientsThe model provided high discriminative capacity andaccuracy (ROC = 0.92)A quantitative association between sCr time series andmortality risk was establishedSimple and informative sCr risk factors were derived

P. M. Pardalos, On Big Data Applications

Page 14: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Study Objective

Data Set

We performed a retrospective study involving patientsadmitted to Shands Hospital (Gainesville, FL) from 2000through 2010.For each patient who underwent a surgery detailed clinicaland outcome data were collected.The analysis included 60,074 patients who underwentin-patient surgery at Shands Hospital during a 10-yearsperiod.

P. M. Pardalos, On Big Data Applications

Page 15: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Logistic Function

Probabilistic score 1

P(C = 1|X = x) =

(1 + exp

(w0 +

m∑i=1

wigi(xi)

))−1

; (1)

C ∈ {0,1} is an outcome;X = (X1, . . . ,Xm) - the risk factors;gi - nonlinear function associated with the i th factorw0 = lnP(C = 1)/P(C = 0) is the a priori log odds ratio.{wi}, i > 0 - weights indicating importance of the risk factors

1S. Saria, A. Rajani,J. Gould,D. Koller,A. Penn, Integration of EarlyPhysiological Responses Predicts Later Illness Severity in Preterm Infants,Sci Transl Med, 2010

P. M. Pardalos, On Big Data Applications

Page 16: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Nonlinear Risk Factors Estimation

From the probabilistic score equation we can derive

w0 +m∑

i=1

wi · gi(xi) = logP(X = x |C = 0)P(X = x |C = 1)

+ logP(C = 0)P(C = 1)

Under assumption of risk factors independence:

w0 +m∑

i=1

wi · gi(xi) =m∑

i=1

logP(Xi = xi |C = 0)P(Xi = xi |C = 1)

+ logP(C = 0)P(C = 1)

.

For continuous factors P(Xi = xi |C) impliesP(Xi ∈ [xi − ε, xi + ε]|C) where ε is sufficiently small

P. M. Pardalos, On Big Data Applications

Page 17: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Model Parameters

maxw

∑k∈{0,1}

∑j: C j =k

lnP(C j = k |Xi = x ji , . . . ,Xi = x j

m);

j denotes patients in the data set;C j ∈ {0,1} - outcome for the j th patient;x j

i - value of i th risk factor for the j th patient.

We solved this problem approximately with interior pointmethod implemented in MATLAB

P. M. Pardalos, On Big Data Applications

Page 18: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Nonlinear risk functions

We get

gi(xi) = lnP(Xi = xi |C = 0)P(Xi = xi |C = 1)

;

w0 = lnP(C = 1)P(C = 0)

;

wi = 1, i = 1, . . . ,m.

The probabilities P(Xi = xi |C), i = 1, . . . ,m were learnedfrom the data using maximum likelihood principle.

P. M. Pardalos, On Big Data Applications

Page 19: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Discrete risk factors estimation

For discrete valued factors

gi(x) = ln

(#{j : C j = 1, x j

i = x}#{j : C j = 1}

#{j : C j = 0}#{j : C j = 1, x j

i = x}

).

P. M. Pardalos, On Big Data Applications

Page 20: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Continuous Risk Factors Estimation

For continuous factors we defined

gi(x) = lnf 0i (x)f 1i (x)

,

f ki (x) = ϕ∗(x − h∗, θ∗),

{ϕ∗,h∗, θ∗} = arg maxϕ∈Φ,θ,h

Lki (ϕ, θ, h),

Lki (ϕ, θ, h) =

∑j:C j =k

lnϕ(x j

i − h, θ)Fϕ(l2i , θ)− Fϕ(l1i , θ)

.

P. M. Pardalos, On Big Data Applications

Page 21: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Continuous Risk Factors Estimation

1 2 3 4 5 6 70

50

100

150

200

maxCr/minCr value

His

togr

am o

f val

ues,

C =

1

Observed valuesLearned Inverse Gaussian distribution

1 2 3 4 5 6 70

0.5

1

1.5

2x 10

4

maxCr/minCr value

His

togr

am o

f val

ues,

C =

0

Observed valuesLearned Loglogistic distribution

1 2 3 4 5 6 70

50

100

150

200

recentCr/minCr value

His

togr

am o

f val

ues,

C =

1

Histogram of observed valuesLearned Lognormal distribution

1 2 3 4 5 6 70

1

2

3x 10

4

recentCr/minCr value

His

togr

am o

f val

ues,

C =

0

Histogram of observed valuesLearned LogLogistic distribution

P. M. Pardalos, On Big Data Applications

Page 22: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Age Risk Factor

20 40 60 80 1000

20

40

60

Age

Dis

trib

utio

n of

val

ues,

C =

1

20 40 60 80 1000

1000

2000

3000

Age

Dis

trib

utio

n of

val

ues,

C =

0

P. M. Pardalos, On Big Data Applications

Page 23: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Which creatinine (Cr) characteristics are ”good”?

Absolute Cr value is not very informative(maximal Cr value)/(minimal Cr value) works goodRecent Cr value is important

P. M. Pardalos, On Big Data Applications

Page 24: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Cr plots in patients with elevated Cr level

P. M. Pardalos, On Big Data Applications

Page 25: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Probabilistic score for mortality riskRisk Factors EstimationCreatinine time series analysis

Our Speculations

The risk is defined by a function R(RC,OD),RC stands for recent conditionOD stands for overall damage suffered during AKIOne should look for features describing these two factors

P. M. Pardalos, On Big Data Applications

Page 26: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Resulting nonlinear risk functions

P. M. Pardalos, On Big Data Applications

Page 27: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Classification accuracy evaluation

70/30 cross-validation analysis has been performedRandom split into 70% training subset and 30% testingsubsetThe average over 100 runs results are reported

P. M. Pardalos, On Big Data Applications

Page 28: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Learned risk factors weights

0.2 0.4 0.6 0.8 1

procedure

admission_type

doctor

morbidity

age

maxCr/minCr

recentCr/minCr

Odd

s ra

tio

20 30 40 50 60 70 80 900

2

4

Age

1 1.5 2 2.5 30

5

10

recentCr/minCr value

1 1.5 2 2.5 30

1

2

maxCr/minCr value

P. M. Pardalos, On Big Data Applications

Page 29: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

The odds ratio based on Cr factors

1.5 2 2.5 3

1.5

2

2.5

3

maxCr/minCr value

rece

ntC

r/m

inC

r va

lue

2

4

6

8

10

12

14

P. M. Pardalos, On Big Data Applications

Page 30: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

The Resulting ROC Curves

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate (1 − specificity)

Tru

e po

sitiv

e ra

te (

sens

itivi

ty)

Both Cr factors, AUC = 0.85maxCr/minCr, AUC = 0.84recentCr/minCr, 0.76

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate (1 − specificity)

Tru

e po

sitiv

e ra

te (

sens

itivi

ty)

Preop. + both Cr factors, AUC = 0.92Preop. + maxCr/minCr, AUC = 0.91Preop. + recentCr/minCr, AUC = 0.91Preop. only, AUC = 0.83

P. M. Pardalos, On Big Data Applications

Page 31: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Cr time series patterns

admission discharge1

2

3

4Odds ratios: No recovery

time

sCr/

min

Cr

1.79 (1.7,1.85)

6.46 (6.04,6.91)

21.39 (19.42,23.5)

admission discharge1

2

3

4Odds ratios: Partial recovery

time

sCr/

min

Cr

0.79 (0.75,0.83)

2.34 (2.26,2.49)

8.3 (7.85,9.04)

admission discharge1

2

3

4Odds ratios: Full recovery

time

sCr/

min

Cr

0.48 (0.44,0.53)

0.63 (0.57,0.73)

0.81 (0.71,0.99)

admission discharge1

2

3

4Odds ratios: Reference patterns

time

sCr/

min

Cr 1 (0.88,1.22)

1.12

1 (0.93,1.12)

1.241 (0.96,1.03) 1.32

0.22 (0.17,0.25)

P. M. Pardalos, On Big Data Applications

Page 32: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Model Calibration

20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Percentile

Mor

talit

y ris

k

Probabilistic model score compared to mortality risks derived from the data

P. M. Pardalos, On Big Data Applications

Page 33: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Risk factors included Prob. model AUC (95% CI)Preop. 0.835 (0.817-0.854)

Hosmer-Lemeshow statistics s = 95.06; p = 0.57Preop. + maxCr/minCr 0.900 (0.888-0.911)

Hosmer-Lemeshow statistics s = 84.33; p = 0.84Preop. + recentCr/minCr 0.910 (0.897-0.923)

Hosmer-Lemeshow statistics s = 72.19; p = 0.98Preop. + both sCr factors 0.917 (0.906-0.929)

Hosmer-Lemeshow statistics s = 68.94; p = 0.99maxCr/minCr only 0.834 (0.816-0.852)

Hosmer-Lemeshow statistics s = 1,818; p = 0recentCr/minCr only 0.791 (0.764-0.818)

Hosmer-Lemeshow statistics s = 112.88; p = 0.14both sCr factors 0.855 (0.837-0.873)

Hosmer-Lemeshow statistics s = 137.31; p = 0.01P. M. Pardalos, On Big Data Applications

Page 34: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

Conclusions

Relatively high classification accuracy was obtained usingprobabilistic modelCreatinine time series do provide complimentaryinformation on patient’s risk of deathPresumably AKI effect on risk of death depends on twofactors:

Kidney recent conditionOverall kidney damage suffered since surgery

P. M. Pardalos, On Big Data Applications

Page 35: P.M.Pardalos - On Big Data Application

Big DataCase Study

Proposed ModelResults and Conclusions

The End

Thank you!

P. M. Pardalos, On Big Data Applications