trend analysis in stulong data the gerstner laboratory for intelligent decision making and control...

24
Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Ji ří Kléma , Lenka Nováková, Filip Karel, Olga Štěpánková PKDD 2004, Discovery Challenge Department of Cybernetics, Czech Technical University, Prague

Upload: milo-gray

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Trend Analysis in Stulong Data

The Gerstner laboratory for intelligent decision making and

control

Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

PKDD 2004, Discovery Challenge

Department of Cybernetics, Czech Technical University,

Prague

Page 2: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Outline Previous CTU entry

– subgroup discovery (ENTRY), general CVD model

– trend analysis: global approach vs. windowing

Role of windowing in mining trends – KM, Cox models in medicine

– (symbolic) temporal trends in data mining

Development of windowing approach– temporal CVD definition

– role of the window length

– multi-feature interactions

Ordinal association rules– processing of the windowed features

Page 3: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

STULONG Data Four tables: Entry, Control, Letter, Death Dependent variable: (static) CVD

– CardioVascular Diseases

– Boolean attribute derived of A2 questionnaire (Control table)

CVD = false The patient has no coronary disease.

CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14)

We remove patients who have diabetes (Hodn4)or cancer (Hodn15) only.

positive angina

pectoris

(silent)myocardial infarction

cerebrovascular accident

ischemic heart

disease

Page 4: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

ENTRY - subgroup discovery AQ no.6: Are there any differences in the ENTRY

examination for different CVD groups? Statistica 6.0

– module for interactive decision tree induction

– two tailed t-test or chi-square test to asses significance of subgroups

Dependencies are relatively weak Interesting dependencies found

– social characteristics: derived attribute AGE_of_ENTRY

– alcohol: “positive effect” of beer, no effect of wine

– sugar consumption increases CVD risk

– well-known dependencies are not mentioned (smoking, BMI, cholesterol)

Page 5: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

ENTRY - general model General CVD model (in WEKA)

– feature selection + modeling (e.g., decision trees)

– tends to generate trivial models (always predicting false)

– asymmetric error-cost matrix does not help

Predict CVD risk– Identify principal variables

(Chi-squared test)

– Naïve Bayes + ROC evaluation

– three independent variables

– discretized AGE_of_ENTRY

– discretized BMI

– Cholrisk - derived of CHLST

– AUC = 0.66

Page 6: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

CONTROL - trend analysis AQ no.7: Are there any differences in development

of risk factors for different CVD groups?– increasing BMI makes a contribution to CVD appearance

ENTRY table CONTR table

ICO – primary keyYear of birthYear of entrySmokingAlcoholCholesterolBody Mass IndexBlood pressure

ICO

Risk factors followedduring 20 years

Page 7: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Motivation focus on development – trend gradients possibilities

– contemporary statistical methods used in medicine

• KM, Cox models – analyze sth else than we want

• ANOVA etc. – features have to be developed anyway, lack of data

– complex sequential data mining

• introduction of structural patterns and then e.g., association rules

• interesting but again needs more data

our approach– introduction of simple aggregates

– application of windowing

– statistical evaluation for simple dependencies

– ordinal association rules for more complex relations

Page 8: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Survival curves Kaplan-Meier or Cox method

– typical example of temporal analysis in medicine

– regards survival period, BUT disregards development of RFs

– typical scenario

• distinguish groups of patients (ENTRY table)

• follow their “survival” periods (DEATH or CONTROL table)

Page 9: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková
Page 10: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Derived trend attributes

Intercept

Gradient

Correlation coefficient

Standard deviation

x (decimal time ~ year + 1/12 month)

y (observed variable)

referential time (1975)

Mean

Page 11: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Global Approach Risk factors to be observed are selected

– SYST, DIAST, TRIGL, BMI, CHLSTMG

Selected control examinations are transformed– pivoting

Patients with no control entries are removed – about 60 patients

Trend aggregates are calculated

ICO Entry Contr1 Contr2 Aggr1 AggrN... ContrM ...

ICO_1

ICO_2

Page 12: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Windowing Approach Constant number of examinations for individuals Issues:

– window length

• time period vs. number of checkups

• how many checkups to select? 5, 8, 10 tested

– single distinct window or sliding window?

• entry is used as the first examination

• more records per patient records are not independent

– temporal CVD definition

• CVDi - time from the last examination to CVD

• yes/no (yes = CVD in the next year or CVD in future)

– missing values treatment

Page 13: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Windowing – missing values

approach 1: shift the series

approach 2:introduce a new value

Page 14: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Window length selection

Page 15: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

3 different lengths tested, 5 risk factors considered

compared with the global approach

test used,

– null hypothesis: independence of trends and CVD

– p-values are shown

windowing: CVD1 vs. nonCVD group

global: CVD vs. nonCVD group

Window length effects

global approach is completely misleading

prefer shorter windowsdown-up effect

prefers longer windowsonly long term changes may have effect

Page 16: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

ControlCount vs. CVD ControlCount

– number of examinations

– strong relation with CVD

– AUC = 0.35

– ControlCount CVD risk

– anachronistic attribute

– introduced by the design of the study

ControlCount has influence on the trend aggregates - ControlCount gradients tend to be more steep etc.

Conclusion: global approach cannot be applied (at least with the selected aggregates)

Page 17: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

1 2 3 4 5

SYSTGrad group (equi-depth binning)

CV

D r

ate

1817

28

25

34

average rate

Influence of SYSTGrad (W5) 122 individual CVD1 observations in total

SYSTGrad (W5) equi-depth binned in 5 groups

representation CVD1 group significantly increases with increasing group number of SYSTGrad

Page 18: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

130

132

134

136

138

140

142

9 8 7 6 5 4 3 2 1 0

Time to last examination [years]

Avg.

syst

olic

blo

od p

ress

ure

[m

m H

g] SystCVD SystHealthy

81

82

83

84

85

86

87

88

9 8 7 6 5 4 3 2 1 0

Time to last examination [years]

Avg.

dia

stolic

blo

od p

ress

ure

[m

m H

g]

DiastCVD DiastHealthy

Averaged blood pressure striking difference in CVD1 and nonCVD groups

– linear vs. down-up development

– can also be observed for the individuals – see the next slide

– cannot be distinguished by longer windows

Page 19: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková
Page 20: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková
Page 21: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Averaged body mass index

difference in CVD1 and nonCVD groups

– steady BMI in the nonCVD group

– increasing BMI in the CVD1 group

– longer windows express this trend better

– this graph shows that W10 may benefit from increase between examination 9 and 8

25.5

26

26.5

27

27.5

28

9 8 7 6 5 4 3 2 1 0

Time to last examination [years]

Avg.

dia

stolic

blo

od p

ress

ure

[m

m H

g]

BMICVD BMIHealthy

Page 22: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Influence of trend aggregates on CVD

– 9 gradients considered: SYST, DIAST, CHLSTMG, TRIGLMG, BMI, HDL, LDL, POCCIG and MOC

Identified relations

– decreasing HDL cholesterol level relates to the increasing risk of CVD (p=0.001)

– decreasing POCCIG (the average number of cigarettes smoked per day) relates to the increasing risk of CVD (p=0.0001)

Again: correlation vs. causality– statement 1 makes sense: HDL is a ’good’ cholesterol – statement 2 suggests spurious dependency

Trend factors – hypothesis testing

patient statecause

smoking habitseffect 1

CVD onseteffect 2

Page 23: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Group a – relations among trend factors

– a great prevalence of the rules joining together either blood pressures (DIASTGrad and SYSTGrad) or cholesterol attributes (HLDGrad, LDLGrad and CHLSTGrad)

Group b - hypothesis to be verified by experts

– insufficient target groups, 6% transactions makes 26 individuals, i.e., instead of 10 prospective diseased patients we actually observe 19

Overview of AR found

Page 24: Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková

Conclusions The main scope

– AQ no.7: Are there any differences in development of risk factors for different CVD groups?

Contributions– Pitfalls of the global approach revealed

– Windowing enabling multivariate temporal analysis proposed, effects of various window lengths studied

– Development of the following risk factors may influence future CVD occurrence:

• DIAST, SYST, BMI, (HDL) cholesterol, (POCCICG)

– Other trends may have or intensify their influence under specific conditions (BMI trend and overweight, etc.) – we lack data to prove it