understanding of data using computational intelligence methods włodzisław duch dept. of...

58
Understanding of data using Understanding of data using Computational Intelligence Computational Intelligence methods methods Włodzisław Duch Włodzisław Duch Dept. of Informatics, Dept. of Informatics, Nicholas Copernicus University, Nicholas Copernicus University, Toruń, Toruń, Poland Poland http://www.phys.uni.torun.pl/~duch http://www.phys.uni.torun.pl/~duch IEA/AIE Cairns, 17-20.06.2002 IEA/AIE Cairns, 17-20.06.2002

Post on 19-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Understanding of data using Understanding of data using Computational Intelligence methods Computational Intelligence methods

Understanding of data using Understanding of data using Computational Intelligence methods Computational Intelligence methods

Włodzisław DuchWłodzisław Duch

Dept. of Informatics, Dept. of Informatics, Nicholas Copernicus University, Nicholas Copernicus University,

Toruń, Toruń, PolandPoland

http://www.phys.uni.torun.pl/~duchhttp://www.phys.uni.torun.pl/~duch

IEA/AIE Cairns, 17-20.06.2002IEA/AIE Cairns, 17-20.06.2002

What am I going to sayWhat am I going to sayWhat am I going to sayWhat am I going to say

• Data and CIData and CI• What we hope for. What we hope for. • Forms of understanding. Forms of understanding. • Visualization. Visualization. • Prototypes. Prototypes. • Logical rules. Logical rules. • Some knowledge discovered. Some knowledge discovered. • Expert system for psychometry. Expert system for psychometry. • Conclusions, or why am I saying this? Conclusions, or why am I saying this?

Types of DataTypes of DataTypes of DataTypes of Data

• Data was precious! Now it is overwhelming ...Data was precious! Now it is overwhelming ...

• Statistical data – clean, numerical, controlled Statistical data – clean, numerical, controlled experiments, vector space model. experiments, vector space model.

• Relational data – marketing, finances. Relational data – marketing, finances. • Textual data – Web, NLP, search. Textual data – Web, NLP, search. • Complex structures – chemistry, economics. Complex structures – chemistry, economics. • Sequence data – bioinformatics. Sequence data – bioinformatics. • Multimedia data – images, video. Multimedia data – images, video. • Signals – dynamic data, biosignals. Signals – dynamic data, biosignals. • AI data – logical problems, games, behavior …AI data – logical problems, games, behavior …

Computational IntelligenceComputational IntelligenceComputational IntelligenceComputational Intelligence

Computational IntelligenceData => KnowledgeArtificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

Soft computing

CI & AI definitionCI & AI definitionCI & AI definitionCI & AI definition• Computational Intelligence is concerned with Computational Intelligence is concerned with

solving effectively non-algorithmic problems.solving effectively non-algorithmic problems.

This corresponds to all cognitive processes, This corresponds to all cognitive processes, including low-level ones (perception).including low-level ones (perception).

• Artificial Intelligence is a part of CI concerned Artificial Intelligence is a part of CI concerned with solving effectively non-algorithmic with solving effectively non-algorithmic problems requiring systematic reasoning and problems requiring systematic reasoning and symbolic knowledge representation. symbolic knowledge representation.

Roughly this corresponds to high-level Roughly this corresponds to high-level cognitive processes.cognitive processes.

Turning data into knowledgeTurning data into knowledgeTurning data into knowledgeTurning data into knowledge

What should CI methods do?What should CI methods do?

• Provide descriptive and predictive non-Provide descriptive and predictive non-parametric models of data.parametric models of data.

• Allow to classify, approximate, associate, Allow to classify, approximate, associate, correlate, complete patterns.correlate, complete patterns.

• Allow to discover new categories and Allow to discover new categories and interesting patterns.interesting patterns.

• Help to visualize multi-dimensional Help to visualize multi-dimensional relationships among data samples. relationships among data samples.

• Allow to understand the data in some way.Allow to understand the data in some way.• Facilitate creation of ES and reasoning. Facilitate creation of ES and reasoning.

Forms of useful knowledgeForms of useful knowledgeForms of useful knowledgeForms of useful knowledge

AI/Machine Learning camp: AI/Machine Learning camp:

Neural nets are black boxes. Neural nets are black boxes.

Unacceptable! Symbolic rules forever.Unacceptable! Symbolic rules forever.

But ... knowledge accessible to humans is in: But ... knowledge accessible to humans is in:

• symbols, symbols, • similarity to prototypes, similarity to prototypes, • images, visual representations. images, visual representations.

What type of explanation is satisfactory?What type of explanation is satisfactory?Interesting question for cognitive scientists.Interesting question for cognitive scientists.

Different answers in different fields. Different answers in different fields.

Data understandingData understandingData understandingData understanding

Types of explanation: Types of explanation:

• visualization-based: maps, diagrams, relations ... visualization-based: maps, diagrams, relations ... • exemplar-based: prototypes and similarity;exemplar-based: prototypes and similarity;• logic-based: symbols and rules. logic-based: symbols and rules.

• Humans remember examples of each Humans remember examples of each category and refer to such examples – category and refer to such examples – as similarity-based or nearest-as similarity-based or nearest-neighbors methods do.neighbors methods do.

• Humans create prototypes out of many Humans create prototypes out of many examples – as Gaussian classifiers, RBF examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. networks, neurofuzzy systems do.

• Logical rules are the highest form of Logical rules are the highest form of summarization of knowledge. summarization of knowledge.

Visualization: dendrogramsVisualization: dendrogramsVisualization: dendrogramsVisualization: dendrograms

All projections (cuboids) on 2D subspaces are All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure.identical, dendrograms do not show the structure.

Normal and malignant lymphocytes.Normal and malignant lymphocytes.

Visualization: 2D projectionsVisualization: 2D projectionsVisualization: 2D projectionsVisualization: 2D projections

All projections (cuboids) on 2D subspaces are All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure.identical, dendrograms do not show the structure.

3-bit parity + all 5-bit combinations.3-bit parity + all 5-bit combinations.

Visualization: MDS mappingVisualization: MDS mappingVisualization: MDS mappingVisualization: MDS mapping

Results of pure MDS mapping + centers of Results of pure MDS mapping + centers of hierarchical clusters connected.hierarchical clusters connected.

3-bit parity + all 5-bit combinations.3-bit parity + all 5-bit combinations.

Visualization: 3D projectionsVisualization: 3D projectionsVisualization: 3D projectionsVisualization: 3D projections

Only Only ageage is continuous, other values are binary is continuous, other values are binary

Fine Needle Aspirate of Breast Lesions, red=malignant, green=benignFine Needle Aspirate of Breast Lesions, red=malignant, green=benignA.J. Walker, S.S. Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521A.J. Walker, S.S. Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521

Visualization: MDS mappingsVisualization: MDS mappingsVisualization: MDS mappingsVisualization: MDS mappings

Try to preserve all distances in 2D nonlinear mappingTry to preserve all distances in 2D nonlinear mapping

MDS large sets using LVQ + relative mapping. MDS large sets using LVQ + relative mapping.

Prototype-based rulesPrototype-based rules

IF P = arg minIF P = arg minR R D(X,R) THAN Class(X)=Class(P)D(X,R) THAN Class(X)=Class(P)

C-rules (Crisp), are a special case of F-rules (fuzzy rules).C-rules (Crisp), are a special case of F-rules (fuzzy rules).F-rules (fuzzy rules) are a special case of P-rules (Prototype).F-rules (fuzzy rules) are a special case of P-rules (Prototype).P-rules have the form:P-rules have the form:

D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P.

P-rules are easy to interpret!

IF X=You are most similar to the P=SupermanTHAN You are in the Super-league.

IF X=You are most similar to the P=Weakling THAN You are in the Failed-league.

“Similar” may involve different features or D(X,P).

P-rulesP-rulesEuclidean distance leads to a Gaussian fuzzy Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. membership functions + product as T-norm.

Manhattan function => (X;P)=exp{|X-P|}

Various distance functions lead to different MF.

Ex. data-dependent distance functions, for symbolic data:

2

2

,,

, ,

,i i

i i ii

i i i i ii i

d X PW X PD

P i i ii i

D d X P W X P

e e e X P

X P

X P

X

, | |

, | |

VDM j i j ii j

PDF i j j ii j

D p C X p C Y

D p X C p C Y

X Y

X Y

Crisp P-rulesCrisp P-rulesCrisp P-rulesCrisp P-rulesNew distance functions from info theory New distance functions from info theory interesting MF. interesting MF.

Membership Functions Membership Functions new distance function, with local new distance function, with local D(X,R) for each cluster. D(X,R) for each cluster.

Crisp logic rules: use L norm:

D(X,P) = ||XP|| = maxi Wi |XiPi|

D(X,P) = const => rectangular contours.

L (Chebyshev) distance with thresholds P

IF D(X,P) P THEN C(X)=C(P)

is equivalent to a conjunctive crisp rule

IF X1[P1PW1,P1PW1] …… XN [PN PWN,PNPWN] THEN C(X)=C(P)

Decision bordersDecision bordersDecision bordersDecision borders

Euclidean distance from 3 prototypes, one per class.

Minkovski =20 distance from 3 prototypes.

D(P,X)=const and decision borders D(P,X)=D(Q,X).

P-rules for WineP-rules for WineP-rules for WineP-rules for Wine

Manhattan distance: Manhattan distance: 66prototypes kept, prototypes kept, 4 errors, 4 errors, f2 removed f2 removed

Many other solutions.Prototypes: SV & clusters.

L distance (crisp rules):

15 prototypes kept, 5 errors, f2, f8, f10 removed

Euclidean distance:11 prototypes kept, 7 errors

Complex objectsComplex objectsComplex objectsComplex objectsVector space concept is not sufficient for Vector space concept is not sufficient for complex object. A common set of features is complex object. A common set of features is meaningless. meaningless.

AI: complex objects, states, subproblems.

General approach: sufficient to evaluate similarity D(Oi,Oj).

Compare Oi, Oj: define transformation

Elementary operators k, eg. substring’s substitutions.

Many T connecting a pair of objects Oi and Oj objects exist.

Cost of transformation = sum of k costs.

Similarity: lowest transformation costs.

Bioinformatics: sophisticated similarity functions for sequences.Dynamic programming finds similarities in reasonable time. Use adaptive costs and general framework for SBM methods.

ˆi k i j

k

O O O T

PromotersPromotersPromotersPromotersDNA strings, 57 aminoacids, 53 + and 53 - samples DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgttactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt

Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4

PDF distance, symbolic s=a, c, t, g replaced by p(s|+)

Connection of CI with AIConnection of CI with AIConnection of CI with AIConnection of CI with AI

AI/CI division is harmful for science!AI/CI division is harmful for science!

GOFAI: operators, state transformations and search GOFAI: operators, state transformations and search techniques are basic tools in AI solving problems requiring techniques are basic tools in AI solving problems requiring systematic reasoning.systematic reasoning.

CI methods may provide useful heuristics for AI and define CI methods may provide useful heuristics for AI and define metric relations between states, problems or complex objects. metric relations between states, problems or complex objects.

Example: combinatorial productivity in AI systems and FSM.

Later: decision tree for complex structures.

Electric circuit exampleElectric circuit exampleElectric circuit exampleElectric circuit exampleAnswering questions in complex domains requires reasoning.Answering questions in complex domains requires reasoning.Qualitative behavior of electric circuit: Qualitative behavior of electric circuit:

7 variables, but Ohm’s law V=I7 variables, but Ohm’s law V=IR, or Kirhoff’s law VR, or Kirhoff’s law Vtt=V=V11+V+V22

Train a NeuroFuzzy system on Ohm’s and Kirhoff’s laws.Train a NeuroFuzzy system on Ohm’s and Kirhoff’s laws.Without solving equations; answer questions of the type:Without solving equations; answer questions of the type:

If If RR22 growsgrows, R, R11 && V Vtt are constantare constant, , what will happen with the what will happen with the current I and voltages current I and voltages V1, V2 ?V1, V2 ?

(taken from the PDP book, McClleland, Rumelhart, Hinton) (taken from the PDP book, McClleland, Rumelhart, Hinton)

Electric circuit searchElectric circuit searchElectric circuit searchElectric circuit searchAI: create search tree, CI: provide guiding intuition.AI: create search tree, CI: provide guiding intuition.

Any law of the form A=B*C or A=B+C, ex: V=I*R, has 13 true Any law of the form A=B*C or A=B+C, ex: V=I*R, has 13 true facts, 14 false facts and may be learned by NF system.facts, 14 false facts and may be learned by NF system.

Geometrical representation:

+ increasing, - decreasing, 0 constant

Find combination of Vt, Rt, I, V1, V2, R1, R2 for which all 5 constraints are fulfilled.

For 111 cases put of 37=2187

Search and check if X can be +, 0, -, laws are not satisfied

if F(Vt=0, Rt, I, V1, V2, R1=0, R2=+) =0

5

1 2 1 21

( , , , , , , ) ( , , )t t i i i ii

F V R I V V R R F A B C

Heuristic searchHeuristic searchHeuristic searchHeuristic searchIf If RR22 growsgrows, R, R11 && V Vtt are constantare constant, , what will happen what will happen with the current I and voltages with the current I and voltages V1, V2 ?V1, V2 ?

We know that: We know that: RR22 =+=+, R, R11 =0, =0, VVtt =0, =0, VV11=?=?, V, V22==??, R, Rtt=?, I =? =?, I =?

Take Take V1=+ and check if:F(Vt=0, Rt=?, I=?, V1=+, V2=?, R1=0, R2=+) >0

Since for all V1=+, 0 and – the function is F()>0 take variable that leads to unique answer, Rt

Single search path solves the problems.

Useful also in approximate reasoning where only some conditions are fulfilled.

Logical rulesLogical rulesLogical rulesLogical rulesCrisp logic rules: for continuous Crisp logic rules: for continuous xx use linguistic use linguistic variables (predicate functions).variables (predicate functions).

sskk((xx) ) şş True [True [XXkkŁŁ xx ŁŁX'X'kk], for example: ], for example: small(small(xx) ) = True{= True{xx||xx << 1}1}medium(medium(xx) = True{) = True{xx||xx [1,2]}[1,2]}large(large(xx) ) = True{= True{xx||xx >> 2}2}

Linguistic variables are used in crisp Linguistic variables are used in crisp (prepositional, Boolean) (prepositional, Boolean) logic logic rules: rules:

IF small-height(IF small-height(XX) AND has-hat() AND has-hat(XX) AND has-) AND has-beard(beard(XX) ) THEN (THEN (XX is a Brownie) is a Brownie) ELSE IF ... ELSE ... ELSE IF ... ELSE ...

Crisp logic decisionsCrisp logic decisionsCrisp logic decisionsCrisp logic decisions

Crisp logic is based on rectangular Crisp logic is based on rectangular membership functions:membership functions:

True/False values jump from 0 to 1. True/False values jump from 0 to 1.

Step functions are used for Step functions are used for partitioning of the feature space. partitioning of the feature space.

Very simple hyper-rectangular Very simple hyper-rectangular decision borders. decision borders.

Severe limitation on the expressive Severe limitation on the expressive power of crisp logical rules! power of crisp logical rules!

DT decisions bordersDT decisions bordersDT decisions bordersDT decisions borders

Decision trees lead to specific decision borders.Decision trees lead to specific decision borders.

SSV tree on Wine data, proline + flavanoids contentSSV tree on Wine data, proline + flavanoids content

Decision tree forests: many decision trees of similar Decision tree forests: many decision trees of similar accuracy, but different selectivity and specificity.accuracy, but different selectivity and specificity.

Logical rules - advantagesLogical rules - advantagesLogical rules - advantagesLogical rules - advantages

Logical rules, if simple enough, are preferable.Logical rules, if simple enough, are preferable.

• Rules may expose limitations of black box Rules may expose limitations of black box solutions. solutions.

• Only relevant features are used in rules. Only relevant features are used in rules. • Rules may sometimes be more accurate than Rules may sometimes be more accurate than

NN and other CI methods. NN and other CI methods. • Overfitting is easy to control, rules usually Overfitting is easy to control, rules usually

have small number of parameters. have small number of parameters. • Rules forever !? Rules forever !?

A logical rule about logical rules is:A logical rule about logical rules is:

IF IF the number of rules is relatively small the number of rules is relatively smallAND the accuracy is sufficiently high. AND the accuracy is sufficiently high. THEN rules THEN rules may bemay be an optimal choice. an optimal choice.

Logical rules - limitationsLogical rules - limitationsLogical rules - limitationsLogical rules - limitations

Logical rules are preferred but ...Logical rules are preferred but ...• Only one class is predicted Only one class is predicted pp((CCii||XX,,MM)) = 0 or 1 = 0 or 1

black-and-white picture may be inappropriate in black-and-white picture may be inappropriate in many applications.many applications.

• Discontinuous cost function allow only non-Discontinuous cost function allow only non-gradient optimization. gradient optimization.

• Sets of rules are unstable: small change in the Sets of rules are unstable: small change in the dataset leads to a large change in structure of dataset leads to a large change in structure of complex sets of rules. complex sets of rules.

• Reliable crisp rules may reject some cases as Reliable crisp rules may reject some cases as unclassified.unclassified.

• Interpretation of crisp rules may be misleading.Interpretation of crisp rules may be misleading.

• Fuzzy rules are not so comprehensible. Fuzzy rules are not so comprehensible.

Rules - choicesRules - choicesRules - choicesRules - choices

Simplicity vs. accuracy. Simplicity vs. accuracy.

Confidence vs. rejection rate.Confidence vs. rejection rate.

true | predicted r

r

p p p pp

p p p p

Accuracy (overall)Accuracy (overall) AA((MM)) = p = p++ ppError rateError rate LL((MM)) = p = p+ p+ p

Rejection rateRejection rate RR((MM))=p=p+r+r+p+prr== 11LL((MM))AA((MM))

SensitivitySensitivity SS++((MM))= p= p+|++|+ = = pp++++ /p/p++

SpecificitySpecificity SS((MM))== pp = p= p /p/p

pp is a hit; is a hit; pp false alarm; false alarm; pp is a miss. is a miss.

Neural networksNeural networks and rulesand rulesNeural networksNeural networks and rulesand rules

Myocardial Infarction~ p(MI|X)

Sex Age SmokingECG: ST

PainIntensity

PainDuration

Elevation

0.7

51 1365Inputs:

Outputweights

Inputweights

Knowledge from networksKnowledge from networksKnowledge from networksKnowledge from networks

Simplify networks: force most weights to 0, quantize remaining parameters, be constructive!

• Regularization: mathematical technique improving predictive abilities of the network.• Result: MLP2LN neural networks that are equivalent to logical rules.

MLP2LNMLP2LNMLP2LNMLP2LN

Converts MLP neural networks into a network Converts MLP neural networks into a network performing logical operations (LN).performing logical operations (LN).

InputInputlayer layer

Aggregation: Aggregation: better featuresbetter features

Output: Output: one node one node per class. per class.

Rule units: Rule units: threshold logicthreshold logic

Linguistic units: Linguistic units: windows, filterswindows, filters

Learning dynamicsLearning dynamicsLearning dynamicsLearning dynamicsDecision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.

Neurofuzzy systemNeurofuzzy systemssNeurofuzzy systemNeurofuzzy systemss

Feature Space Mapping (FSM) neurofuzzy system.Feature Space Mapping (FSM) neurofuzzy system.Neural adaptation, estimation of probability density Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions:(RBF-like) with nodes realizing separable functions:

1

; ;i i ii

G X P G X P

Fuzzy: Fuzzy: xx(no/yes) replaced by a degree (no/yes) replaced by a degree xx. Triangular, trapezoidal, Gaussian . Triangular, trapezoidal, Gaussian ...... MFMF..

M.f-s in many dimensions:

Heterogeneous systemsHeterogeneous systemsHeterogeneous systemsHeterogeneous systems

Homogenous systems: one type of “building blocks”, Homogenous systems: one type of “building blocks”, same type of decision borders.same type of decision borders.

Ex: neural networks, SVMs, decision trees, kNNs ….Ex: neural networks, SVMs, decision trees, kNNs ….

Committees combine many models together, but lead to Committees combine many models together, but lead to complex models that are difficult to understand. complex models that are difficult to understand.

Discovering simplest class structures, its inductive bias:requires heterogeneous adaptive systems (HAS).

Ockham razor: simpler systems are better.

HAS examples:NN with many types of neuron transfer functions.k-NN with different distance functions.DT with different types of test criteria.

GhostMiner PhilosophyGhostMiner PhilosophyGhostMiner PhilosophyGhostMiner Philosophy

• There is no free lunch – provide different type of tools There is no free lunch – provide different type of tools for knowledge discovery. for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, Decision tree, neural, neurofuzzy, similarity-based, committees.committees.

• Provide tools for visualization of data.Provide tools for visualization of data.• Support the process of knowledge discovery/model Support the process of knowledge discovery/model

building and evaluating, organizing it into projects.building and evaluating, organizing it into projects.

GhostMiner, data mining tools from our lab. GhostMiner, data mining tools from our lab.

http://www.fqspl.com.pl/ghostminer/http://www.fqspl.com.pl/ghostminer/

• Separate the process of model building and Separate the process of model building and knowledge discovery from model use => knowledge discovery from model use =>

GhostMiner Developer & GhostMiner Analyzer.GhostMiner Developer & GhostMiner Analyzer.

Recurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancer

Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia.

286 cases, 201 no recurrence (70.3%), 85 recurrence cases (29.7%)

no-recurrence-events, 40-49, premeno, 25-29, 0-2, ?, 2, left, right_low, yes

9 nominal features: age (9 bins), menopause, tumor-size (12 bins), nodes involved (13 bins), node-caps, degree-malignant (1,2,3), breast, breast quad, radiation.

Recurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancer

Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia.

Many systems used, 65-78% accuracy reported.

Single rule:IF (nodes-involved [0,2] degree-malignant = 3 THEN recurrence, ELSE no-recurrence

76.2% accuracy, only trivial knowledge in the data:

“Highly malignant breast cancer involving many nodes is likely to strike back.”

Recurrence - comparison. Recurrence - comparison. Recurrence - comparison. Recurrence - comparison.

Method 10xCV accuracy

MLP2LN 1 rule 76.2 SSV DT stable rules 75.7 1.0

k-NN, k=10, Canberra 74.1 1.2

MLP+backprop. 73.5 9.4 (Zarndt)CART DT 71.4 5.0 (Zarndt) FSM, Gaussian nodes 71.7 6.8 Naive Bayes 69.3 10.0 (Zarndt)

Other decision trees < 70.0

Breast cancer diagnosis. Breast cancer diagnosis. Breast cancer diagnosis. Breast cancer diagnosis.

Data from University of Wisconsin Hospital, Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg.Madison, collected by dr. W.H. Wolberg.

699 cases, 9 cell features quantized from 1 to 10:

clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses.

Tasks: distinguish benign from malignant cases.

Breast cancer rules. Breast cancer rules. Breast cancer rules. Breast cancer rules.

Data from University of Wisconsin Hospital, Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg.Madison, collected by dr. W.H. Wolberg.

Simplest rule from MLP2LN, large regularization:

If uniformity of cell size < 3Then benign Else malignant

Sensitivity=0.97, Specificity=0.85

More complex solutions (3 rules) give in 10CV:Sensitivity =0.95, Specificity=0.96, Accuracy=0.96

Breast cancer comparison. Breast cancer comparison. Breast cancer comparison. Breast cancer comparison.

Method 10xCV accuracy

k-NN, k=3, Manh 97.0 2.1 (GM)FSM, neurofuzzy 96.9 1.4 (GM)

Fisher LDA 96.8 MLP+backprop. 96.7 (Ster, Dobnikar)LVQ 96.6 (Ster, Dobnikar) IncNet (neural) 96.4 2.1 (GM)Naive Bayes 96.4 SSV DT, 3 crisp rules 96.0 2.9 (GM) LDA (linear discriminant) 96.0 Various decision trees 93.5-95.6

SSV HAS WisconsinSSV HAS WisconsinSSV HAS WisconsinSSV HAS WisconsinHeterogeneous decision tree that searches not only for logical Heterogeneous decision tree that searches not only for logical rules but also for prototype-based rules.rules but also for prototype-based rules.

Single P-rule gives simplest known description of this data: Single P-rule gives simplest known description of this data:

IF ||X-RIF ||X-R303303|| < 20.27 then malignant|| < 20.27 then malignant

else benignelse benign

18 errors, 97.4% accuracy. Good prototype for malignant! 18 errors, 97.4% accuracy. Good prototype for malignant!

Simple thresholds, that’s what MDs like the most!Simple thresholds, that’s what MDs like the most!

Best L1O error Best L1O error 98.3% (FSM), 98.3% (FSM),

best 10CV around best 10CV around 97.5% (Naïve Bayes + kernel, SVM) 97.5% (Naïve Bayes + kernel, SVM)

C 4.5 gives C 4.5 gives 94.7±2.0% 94.7±2.0%

SSV without distances: 96.4±2.1%SSV without distances: 96.4±2.1%

Several simple rules of similar accuracy in CV tests exist.Several simple rules of similar accuracy in CV tests exist.

Collected in the Outpatient Center of Dermatology in Rzeszów, Poland.

Four types of Melanoma: benign, blue, suspicious, or malignant.

250 cases, with almost equal class distribution.

Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5).

TDS (Total Dermatoscopy Score) - single index

Goal: hardware scanner for preliminary diagnosis.

Melanoma skin cancerMelanoma skin cancerMelanoma skin cancerMelanoma skin cancer

Method Rules Training % Test %

MLP2LN, crisp rules 4 98.0 all 100

SSV Tree, crisp rules 4 97.5±0.3 100

FSM, rectangular f. 7 95.5±1.0 100

knn+ prototype selection 13 97.5±0.0 100

FSM, Gaussian f. 15 93.7±1.0 95±3.6

knn k=1, Manh, 2 features -- 97.4±0.3 100

LERS, rough rules 21 -- 96.2

Melanoma resultsMelanoma resultsMelanoma resultsMelanoma results

27 features taken into account: polarity, size, hydrogen-bond donor or acceptor, pi-donor or acceptor, polarizability, sigma effect.

Pairs of chemicals, 54 features, are compared, which one has higher activity?

2788 cases, 5-fold crossvalidation tests.

Antibiotic activity of Antibiotic activity of pyrimidine compounds.pyrimidine compounds.

Antibiotic activity of Antibiotic activity of pyrimidine compounds.pyrimidine compounds.

Pyrimidines: which compound has stronger antibiotic activity?

Common template, substitutions added at 3 positions, R3, R4 and R5.

Antibiotic activity - results.Antibiotic activity - results.Antibiotic activity - results.Antibiotic activity - results.

Pyrimidines: which compound has stronger antibiotic activity?

Mean Spearman's rank correlation coefficient used: rs

Method Rank correlation

FSM, 41 Gaussian rules 0.77±0.03Golem (ILP) 0.68Linear regression 0.65CART (decision tree) 0.50

Thyroid screening.Thyroid screening.Thyroid screening.Thyroid screening.

Garavan Institute, Sydney, Australia

15 binary, 6 continuous

Training: 93+191+3488 Validate: 73+177+3178

Determine important clinical factors

Calculate prob. of each diagnosis.

Hiddenunits

Finaldiagnoses

TSHT4U

Clinical findings

Agesex……

T3

TT4

TBG

Normal

Hyperthyroid

Hypothyroid

Thyroid – some results.Thyroid – some results.Thyroid – some results.Thyroid – some results.Accuracy of diagnoses obtained with different systems.

Method Rules/Features Training % Test %

MLP2LN optimized 4/6 99.9 99.36

CART/SSV Decision Trees 3/5 99.8 99.33

Best Backprop MLP -/21 100 98.5

Naïve Bayes -/- 97.0 96.1

k-nearest neighbors -/- - 93.8

PsychometryPsychometryPsychometryPsychometryMMPI (Minnesota Multiphasic Personality MMPI (Minnesota Multiphasic Personality Inventory) psychometric test.Inventory) psychometric test.

Printed formsPrinted forms are scanned or are scanned or computerized versioncomputerized version of the test is used. of the test is used.

• Raw data: 550 questions, ex:I am getting tired quickly: Yes - Don’t know - No

• Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients.

• Each scale measures tendencies towards hypochondria, schizophrenia, psychopathic deviations, depression, hysteria, paranoia etc.

PsychometryPsychometryPsychometryPsychometry

• There is no simple correlation between single values and final diagnosis.

• Results are displayed in form of a histogram, called ‘a psychogram’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.

Goal: an expert system providing evaluation and interpretation of MMPI tests at an expert level.

Problem: agreement between experts only 70% of the time; alternative diagnosis and personality changes over time are important.

Psychometric dataPsychometric dataPsychometric dataPsychometric data

1600 cases for woman, same number for men.1600 cases for woman, same number for men.

27 classes: 27 classes: norm, psychopathic, schizophrenia, paranoia, norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal addiction, criminal tendencies, abnormal behavior due to ... behavior due to ...

Extraction of logical rules: 14 scales = features.

Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.

Psychometric dataPsychometric dataPsychometric dataPsychometric data

10-CV for FSM is 82-85%, for C4.5 is 79-84%. Input uncertainty ++GGxx around 1.5% (best ROC) improves FSM results to 90-92%.

MethodMethod DataData N. rulesN. rules AccuracyAccuracy ++GGxx%%

C 4.5C 4.5 ♀♀ 5555 93.093.0 93.793.7

♂♂ 6161 92.592.5 93.193.1

FSMFSM ♀♀ 6969 95.495.4 97.697.6

♂♂ 9898 95.995.9 96.996.9

Psychometric ExpertPsychometric ExpertPsychometric ExpertPsychometric ExpertProbabilities for different classes. Probabilities for different classes. For greater uncertainties more For greater uncertainties more classes are predicted. classes are predicted.

Fitting the rules to the conditions:Fitting the rules to the conditions:typically 3-5 conditions per rule, typically 3-5 conditions per rule, Gaussian distributions around Gaussian distributions around measured values that fall into the measured values that fall into the rule interval are shown in green. rule interval are shown in green.

Verbal interpretation of each Verbal interpretation of each case, rule and scale dependent.case, rule and scale dependent.

VisualizationVisualizationVisualizationVisualizationProbability of classes versus Probability of classes versus input uncertainty.input uncertainty.

Detailed input probabilities Detailed input probabilities around the measured values around the measured values vs. change in the single scale; vs. change in the single scale; changes over time define changes over time define ‘patients trajectory’. ‘patients trajectory’.

Interactive multidimensional Interactive multidimensional scaling: zooming on the new scaling: zooming on the new case to inspect its similarity to case to inspect its similarity to other cases.other cases.

ConclusionsConclusionsConclusionsConclusionsData understanding is challenging problem.Data understanding is challenging problem.

• Classification rules are frequently only the first step and Classification rules are frequently only the first step and may not be the best solution.may not be the best solution.

• Visualization is always helpful. Visualization is always helpful. • P-rules may be competitive if complex decision borders P-rules may be competitive if complex decision borders

are required, providing different types of rules. are required, providing different types of rules. • Understanding of complex objects is possible, although Understanding of complex objects is possible, although

difficult, using adaptive costs and distance as least difficult, using adaptive costs and distance as least expensive transformations (action principles in physics). expensive transformations (action principles in physics).

• Great applications are coming! Great applications are coming!

ChallengesChallengesChallengesChallenges

• Discovery of theories rather than data modelsDiscovery of theories rather than data models• Integration with image/signal analysisIntegration with image/signal analysis• Integration with reasoning in complex domainsIntegration with reasoning in complex domains• Combining expert systems with neural networksCombining expert systems with neural networks

……..

Fully automatic universal data analysis systems: Fully automatic universal data analysis systems: press the button and wait for the truth …press the button and wait for the truth …

We are slowly getting there. We are slowly getting there.

More & more computational intelligence tools More & more computational intelligence tools (including our own) are available. (including our own) are available.