understanding medical data włodzisław duch department of informatics nicholas copernicus...

Post on 15-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Understanding Medical Data Understanding Medical Data

Włodzisław DuchDepartment of Informatics

Nicholas Copernicus University, Toruń, Poland

www.phys.uni.torun.pl/~duch

PlanPlanPlanPlan

1.1. What’s the problem? What’s the problem?

2.2. What we would like to haveWhat we would like to have

3.3. How to get thatHow to get that

4.4. Some methods to understand the dataSome methods to understand the data

5.5. Some resultSome result

6.6. Example of applicationsExample of applications

7.7. Expert system for psychometryExpert system for psychometry

8.8. ConclusionsConclusions

Department of Computer Department of Computer MethodsMethods

Department of Computer Department of Computer MethodsMethods

Computational Intelligence Methods:Computational Intelligence Methods:• neural networksneural networks• decision treesdecision trees• similarity-based methodssimilarity-based methods• visualizationvisualization

Cognitive and Brain ScienceCognitive and Brain Science• theory of mindtheory of mind• modeling human attentionmodeling human attention

ApplicationsApplications• data from psychometrics, medicine, data from psychometrics, medicine, • astronomy, physics, chemistry ... astronomy, physics, chemistry ...

with help from ...with help from ...

What’s the problem?What’s the problem?What’s the problem?What’s the problem?

44,000 to 98,000 patients die from medical errors 44,000 to 98,000 patients die from medical errors every year in US hospitals. More people die from every year in US hospitals. More people die from medical errors in hospitalization than from motor medical errors in hospitalization than from motor vehicle accidents, breast cancer, or AIDS.vehicle accidents, breast cancer, or AIDS.

Institute of Medicine (Dec. 1999)Institute of Medicine (Dec. 1999)

Decision Support Systems (DSS)Decision Support Systems (DSS)

Expert Systems: based on knowledge spoon-fed to Expert Systems: based on knowledge spoon-fed to the DSS - if domain knowledge is well understood.the DSS - if domain knowledge is well understood.

Intuition, experience, implicit knowledge: discover Intuition, experience, implicit knowledge: discover knowledge hidden in the data and use it in DSS.knowledge hidden in the data and use it in DSS.

What we would like to haveWhat we would like to haveWhat we would like to haveWhat we would like to have

Start from good & bad examples evaluated by experts.Start from good & bad examples evaluated by experts.

Understand the data: Understand the data:

provide rules; provide rules;

provide prototype cases and provide prototype cases and similarity measures;similarity measures;

provide visualization.provide visualization.

Rules should be:Rules should be:

simple and/or accurate;simple and/or accurate;

reliable and/or general;reliable and/or general;

robust and stable;robust and stable;

include alternatives, eliminate improbable.include alternatives, eliminate improbable.

Logical explanationsLogical explanationsLogical explanationsLogical explanations

Logical rules, if simple enough, are usually preferred.

• Rules may expose limitations of black box methods: statistical, neural or computational intelligence (CI).

• Only relevant features are used in rules. • Rules are sometimes more accurate than NN

and other CI methods. • Overfitting is easy to control, rules usually

have small number of parameters. • Rules forever !?

• IF the number of rules is relatively small• AND the accuracy is sufficiently high. • THEN rules may be an optimal choice.

Data example Data example Data example Data example

Cleveland Clinic Foundation heart-disease data.

Numerical, symbolic, logical, missing features.

44, male,atyp_angina,120,263,f,normal,173,no,0, up,0, reversible_defect,'<50‘

52, male,non_anginal,172,199,t,normal,162,no,0.5, up,0, reversible_defect,'<50‘

48, male,atyp_angina,110,229,f,normal,168,no,1, down,0, reversible_defect,'>50_1‘

54, male,asympt,140,239,f,normal,160,no,1.2, up,0, normal,'<50‘

48, female,non_anginal,130,275,f,normal,139,no,0.2,up,0, normal,'<50‘

Logical rulesLogical rulesLogical rulesLogical rulesCrisp logic rules: for continuous x use linguistic variables (predicate functions):

sk(x) ş True [XkŁ x Ł X'k], for example: low(x) = True{x|x < 70}normal(x) = True{x|x [70,120]}high(x) = True{x|x > 120}

Linguistic variables are used in crisp (prepositional, Boolean) logic rules:

IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF ... ELSE ...

Crisp/fuzzy logic decisionsCrisp/fuzzy logic decisionsCrisp/fuzzy logic decisionsCrisp/fuzzy logic decisions

Crisp logic based on rectangular membership functions:

True/False values jump from 0 to 1. Step functions partition the input space.

Fuzzy: x(no/yes) replaced by a degree x. Triangular, trapezoidal, Gaussian or other membership f.

Providing rulesProviding rulesProviding rulesProviding rulesCI approaches: neural networks, decision trees, similarity-based methods, machine learning inductive methods ...

Neural networks: many types, large field. Most popular: threshold logic (perceptron) unit, realizing M-of-N rule: IF (M conditions of N are true) THEN ... Multi-layer perceptron (MLP) networks: stack many units.

Problem: for N inputs number of subsets is 2N. Exponentially growing number of possible conjunctive rules.

MLP2LNMLP2LNMLP2LNMLP2LN

Converts MLP neural networks into a network performing logical operations (LN).

Inputlayer

Aggregation: better features

Output: one node per class.

Rule units: threshold logic

Linguistic units: windows, filters

MLP2LN trainingMLP2LN trainingMLP2LN trainingMLP2LN training

Constructive algorithm: add as many nodes as needed.

Optimize cost function:minimize errors +

enforce zero connections +

leave only +1 and -1 weightsmakes interpretation easy.

2

21

2 2 22

1( ) ( ( ; ) )

2

2

( 1) ( 1)2

p pi i

p i

iji j

ij ij iji j

E F t

W

W W W

W X W

L-unitsL-unitsL-unitsL-unitsCreate linguistic variables.

1 1 2 2( ) 'L X S W x b S W x b

Numerical representation for R-nodes

Vsk=() for sk=low

Vsk=() for sk=normal

L-units: 2 thresholds as adaptive parameters; logistic (x) or tangh(x) [ Soft trapezoidal functions change into rectangular filters (Parzen windows). 4 types, depending on signs Si.

Iris exampleIris exampleIris exampleIris example

Network after training: Network after training:

iris setosa: =1 (0,0,0;0,0,0;+1,0,0;+1,0,0) iris versicolor: =2 (0,0,0;0,0,0;0,+1,0;0,+1,0)iris virginica: =1(0,0,0;0,0,0;0,0,+1;0,0,+1)

Rules:

If (x3=s x4=s) setosa

If (x3=m x4=m) versicolor

If (x3=l x4=l) virginica

Makes 3 errors (98%).

Learning dynamicsLearning dynamicsLearning dynamicsLearning dynamicsDecision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.

Other rule extraction Other rule extraction methodsmethods

Other rule extraction Other rule extraction methodsmethods

Feature Space Mapping (FSM) neurofuzzy system.Neural adaptation, estimation of probability density distribution using single hidden layer network with nodes realizing separable functions:

1

; ;i i ii

G X P G X P

Separability Split Value (SSV) decision tree.Based on maximization of the number of correctly separated pairs of vectors. Uni- or multi-variate tests, easy to convert to rules.

Similarity-Based Learner (SBL).Framework for many similarity-based methods.Searches in the space of all models for the best one for a given data. Gives prototype-based rules, more general than fuzzy rules.

Applying rulesApplying rulesApplying rulesApplying rulesHow to get probabilities out of rules? Gaussian uncertainties:

( ; , )x xx G G y x s

x - Gaussian fuzzy number.

A set R of crisp logical rules (or any other system) applied to fuzzy inputs X gives probabilities p(Ci|X) via Monte Carlo sampling.

For crisp rule Ra(x) = {xa} and fuzzy input value Gx analytical probabilities evaluation is based on cumulant:

1( ) T ; , 1 erf ( )

2 2

a

a x x

x

a xp R G G y x s dy a x

s

1 1 exp

2.4 / 2 x

x x

s

accuracy of this approximation is < 0.02

Analytical solutionAnalytical solutionRule Rab(x) = {xa,b]} and Gx input has probability:

1 2 1 1 2 2 1 1 2 2| , ;p C x x R p R x p R x p R x R x

Exact for (x)(1- (x)) error distributions instead of Gaussian. In MLP neural networks: L-units

( ) T ; , ( ) ( )b

ab x x

a

p R G G y x s dy x a x b

Rules with two or more features in conditions: add probabilities and subtract the overlap:

Fuzzy rules + crisp data <=> Crisp rules + fuzzy input

Large receptive fields: linguistic variables; small receptive fields: smooth edges.

OptimizationOptimizationOptimizationOptimizationConfidence-rejection tradeoff.

; , | Tr , |i j i ji j

E M F C C M F C C M N

Confusion matrix F(Ci,Cj|M) = Nij/N

frequency of assigning class Cj to class Ci by the model M.

, |i j

F FF C C M

F F

Sensitivity: S+=F++/(F+++F+)[0,1]

Specificity: S=F/(F+F) [0,1]

S=1 class (sick) is never assigned to class + (healthy)

S=1 class + is never assigned to

Perfect sensitivity/specificity: minimize off-diagonal elements of F(Ci,Cj|M)

Maximize the number of correctly assigned cases (diagonal)

AdvantagesAdvantagesAdvantagesAdvantagesGeneration, uncertainty, optimization:

1. Network regularization parameters allow to discover different sets of rules: simplest, most general, less accurate vs. more complex, specialized, more accurate.

2. Continuous probabilities instead of Yes/No answers;stabilizes the model, no sudden jumps in predictions.

3. New data has always some probability, but it may be labeled as ‘unknown’, or elimination instead of classification may be used if classes are mixed strongly.

4. Multivariate ROC curves may be generated setting the output thresholds.

5. Data uncertainties sx may be used as adaptive parameters.Large-scale gradient optimization of a cost function:

21{ }; , | ; ( ),

2x i iX i

E X R s p C X M C X C

ApplicationsApplicationsApplicationsApplicationsIn medicine, science, technology ...

Fun and benchmark:

Mushrooms. Stylometry - who wrote ‘The two noble kinsmen’?

Medical

Reccurence of breast cancer (Ljubliana).Diagnosis of breast cancer (Wisconsin).

Thyroid screening (J. Cook University, Australia).Melanoma cancer (Rzeszów, Poland).Hepatobiliary disorders (Tokyo)

Chemical

Antibiotic activity of pyrimidine compounds. Carcinogenicity of organic chemicals

Psychometry: MMPI evaluations

MushroomsMushroomsMushroomsMushroomsThe Mushroom Guide: no simple rule for mushrooms; no rule like: ‘leaflets three, let it be’ for Poisonous Oak and Ivy.

8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2118=3.1035 possible input vectors.

Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicySpore print color: black, brown, buff, chocolate, green, orange, purple, white, yellow.

Safe rule for edible mushrooms: odor = (almond.or.anise.or.none) Ů spore-print-color = Ř green

48 errors, 99.41% correct

This is why animals have such a good sense of smell! What does it tell us about odor receptors?

Mushrooms rulesMushrooms rulesMushrooms rulesMushrooms rulesTo eat or not to eat, this is the question! Not any more ...

A mushroom is poisonous if: R1) odor = Ř (almond anise none); 120 errors, 98.52% R2) spore-print-color = green 48 errors, 99.41% R3) odor = none Ů stalk-surface-below-ring = scaly Ů stalk-color-above-ring = Ř brown 8 errors, 99.90% R4) habitat = leaves Ů cap-color = white no errors!

R1 + R2 are quite stable, found even with 10% of data; R3 and R4 may be replaced by other rules, ex:

R'3): gill-size = narrow Ů stalk-surface-above-ring = (silky scaly) R'4): gill-size = narrow Ů population = clustered

Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.

Recurrence of breast cancerRecurrence of breast cancer

Institute of Oncology, University Medical Center, Ljubljana.

286 cases, 201 no (70.3%), 85 recurrence cases (29.7%)

9 symbolic features: age (9 bins), tumor-size (12 bins), nodes involved (13 bins), degree-malignant (1,2,3), area, radiation, menopause, node-caps. no-recurrence-events,40-49,premeno,25-29,0-2,?,2,left,right_low,yes

Many systems tried, 65-78% accuracy reported. Single rule:

IF (nodes-involved [0,2] degree-malignant = 3 THEN recurrence ELSE no-recurrence

77% accuracy, only trivial knowledge in the data: Highly malignant cancer involving many nodes is likely to strike back.

Breast cancer diagnosis. Breast cancer diagnosis.

Data obtained from the University of Wisconsin Hospitals, Madison, collected by dr. W.H. Wolberg.

699 cases, 9 features quantized from 1 to 10: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses

Distinguish benign from malignant cases.

Simplest rules from MLP2LN, large regularization:

IF f2 ł 7 f7 ł 6 THEN malignant f2 - uniformity of cell sizeELSE benign f7 - bland chromatin

Overall accuracy (including ELSE condition) is 94.9%.

Breast cancer rules. Breast cancer rules. Breast cancer rules. Breast cancer rules.

Using lower regularization hierarchical sets of rules with increasing accuracy are created. Optimized set of rules:

R1: IF f2<6 Ů f4<3 Ů f8<8 THEN malignant (99.8)% R2: f2<9 Ů f5<4 Ů f7<2 Ů f8<5 (100)% R3: f2<10 Ů f4<4 Ů f5<4 Ů f7<3 (100)% R4: f2<7 Ů f4<9 Ů f5<3 Ů f7[4,9] Ů f8<4 (100)% R5: f2[3,4] Ů f4<9 Ů f5<10 Ů f7<6 Ů f8<8 (99.8)% ELSE benign

6 errors, overall reclassification accuracy 99.0% R1 and R5 misclassify the same single benign vector.

100% reliable set of rules reject 51 cases (7.3%).

Other solutions: 3 rules from SSV decision tree

Breast cancer - comparison. Breast cancer - comparison. Breast cancer - comparison. Breast cancer - comparison.

Results from the 10-fold (stratified) crossvalidation.

Method 10xCV accuracy

SSV, 3 crisp rules 96.30.2 FSM, 14 Gaussians 96.40.2

IncNet 97.1k-NN, k=3, Manh97.00.1Fisher LDA 96.8 MLP+backprop 96.7LVQ 96.6Bayes (pairwise dep.) 96.6Naive Bayes 96.4 LDA 96.0 LFC, ASI, ASR dec. trees 94.4-95.6CART (dec. tree) 93.5

Rule-based

Other classifiers

Collected in the Outpatient Center of Dermatology in Rzeszów, Poland.

Four types of Melanoma: benign, blue, suspicious, or malignant.

250 cases, with almost equal class distribution.

Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5).

TDS (Total Dermatoscopy Score) - single index

26 new test cases only.

Goal: understand the data, find simple description, make a hardware scanner for preliminary diagnosis.

Melanoma skin cancerMelanoma skin cancerMelanoma skin cancerMelanoma skin cancer

Method Rules Training % Test %

MLP2LN, crisp rules 4 98.0 all 100

SSV Tree, crisp rules 4 97.5±0.3 100

FSM, rectangular f. 7 95.5±1.0 100

knn+ prototype selection 13 97.5±0.0 100

FSM, Gaussian f. 15 93.7±1.0 95±3.6

knn k=1, Manh, 2 feat. 250 97.4±0.3 100

LERS, rough rules 21 -- 96.2

Melnanoma resultsMelnanoma results

27 features taken into account: polarity, size, hydrogen-bond donoror acceptor, pi-donor or acceptor, polarizability, sigma effect.

Pairs of chemicals (54 features) are compared.

Two classes: first compound has higher activity or vice versa.

2788 cases, 5-fold crossvalidation tests.

Antibiotic activity of pyrimidine compounds.

Pirimindines: which compound has stronger antibiotic activity?

Common template, substitutions added at 3 positions, R3, R4 and R5.

Mean Spearman's rank correlation coefficient used: 1<rs<+1;

Method Rank correlation

FSM, 41 rules 0.77±0.03Golem (ILP) 0.68Linear regression 0.65CART 0.50

PsychometricsPsychometricsPsychometricsPsychometricsMMPI (Minnesota Multiphasic Personality Inventory) test (v. 1).Forms were scanned or computerized version of the test is used.

1. Row data: 550 questions, ex:I am getting tired quickly Yes - Don’t know - No

2. Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients.

3. Each scale measures tendencies towards hypochondria, schizophrenia, hypomania, psychopathic deviations, symptoms of depression, hysteria, paranoia, social introversion etc, but there is no simple correlation between single values and final diagnosis.

4. Results are displayed in form of a histogram, called ‘a psychogram’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.

Goal: an expert system providing evaluation and interpretation of MMPI tests at an expert level. Agreement between experts 70% of the time; alternative diagnosis and personality changes over time are important.

Psychometric dataPsychometric dataPsychometric dataPsychometric data1600 cases for woman, similar number for men.27 classes: norm, psychopathia, schizophrenia, paranoia, neurosis, mania, simmulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to ...

Extraction of rules: 14 scales, define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.

Method Data N. rules Accuracy % + Gx %

C 4.5 ♀ 55 93.0 93.7

♂ 61 92.5 93.1

FSM ♀ 69 95.4 97.6

♂ 98 95.9 96.9

10-CV gives 82-85% with FSM and 79-84% with C4.5. Input uncertainty around 1.5% improves FSM results to 90-92%

ResultsResultsResultsResults

Probabilities for different classes. For greater uncertainties more classes are predicted.

Fitting the rules to the conditions:typically 3-5 conditions per rule, the Gaussian distributions around measured values that fall into the rule interval are shown in green.

Verbal interpretation of each case, rule and scale dependent.

VisualizationVisualizationVisualizationVisualization

Probability of classes versus input uncertainty.

Detailed input probabilities around the measured values vs. change in the single scale; changes over time define ‘patients trajectory’.

Interactive multidimensional scaling: zooming on the new case to inspect its similarity to other cases.

Bioinformatics exampleBioinformatics exampleBioinformatics exampleBioinformatics example

Evaluation of similarity of E. Coli gene sequences.

Promoters: red; non-promoters - greenLeft: standard similarity measure; right - new measure.

SummarySummarySummarySummaryUnderstanding data: extracting rules, prototypes, visualizing.Computational intelligence methods: neural, decision trees, similarity-based & other, help to understand the data.

We are slowly getting there. All this and more is included in the Ghostminer, data mining software (in collaboration with Fujitsu) soon to be finished :)

Small is beautiful simple is the best!Simplest possible, but not simpler - regularization of models; accurate but not too accurate - handling of uncertainty; high confidence, but not paranoid - rejecting some cases.

Challenges:hierarchical systems, discovery of theories rather than models, integration with image/signal analysis, reasoning in complex domains, applications in bioinformatics ...

ReferencesReferencesReferencesReferencesIEEE Transactions on Neural Networks 12 (2001) 277-306 (March issue)www.phys.uni.torun.pl/~duch paper archive

top related