interestingness measures. quality in kdd levels of quality quality of discovered knowledge :...

114
Interestingness Interestingness Measures Measures

Upload: clayton-hawley

Post on 14-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Interestingness MeasuresInterestingness Measures

Quality in KDDQuality in KDD

Quality in KDDQuality in KDD

Levels of QualityLevels of Quality

Quality of discovered knowledge : f(D,M,U)Quality of discovered knowledge : f(D,M,U)Data Quality (D)Data Quality (D)

Noise, accuracy, missing values, bad values, … Noise, accuracy, missing values, bad values, … [Berti-Equille 2004][Berti-Equille 2004]

Model Quality (M)Model Quality (M)Accuracy, generalization, relevance, …Accuracy, generalization, relevance, …

User-based Quality (U)User-based Quality (U)Relevance for decision makingRelevance for decision making

User-based QualityUser-based Quality

2 categories:2 categories: Objective (D, M)Objective (D, M)

Computed from data onlyComputed from data only

Subjective (U)Subjective (U) Hypothesis : goal, domain knowledgeHypothesis : goal, domain knowledge Hard to formalize (novelty)Hard to formalize (novelty)

Examples of Quality CriteriaExamples of Quality Criteria

7 criteria of interest (interestingness) [Hussein 7 criteria of interest (interestingness) [Hussein 2000]:2000]:

Objective:Objective: Generality : (ex: Support)Generality : (ex: Support) Validity: (ex: Confidence)Validity: (ex: Confidence) Reliability: (ex: High generality and validity)Reliability: (ex: High generality and validity)

Subjective:Subjective: Common Sense: reliable + known yetCommon Sense: reliable + known yet Actionability : utility for decisionActionability : utility for decision Novelty: previously unknownNovelty: previously unknown Surprise (Unexpectedness): contradiction ?Surprise (Unexpectedness): contradiction ?

Objective MeasuresObjective Measures

Quality and Association Quality and Association RulesRules

Association RulesAssociation Rules

Association rules [Agrawal et al. 1993]:Association rules [Agrawal et al. 1993]:Market-basket analysisMarket-basket analysisNon supervised learningNon supervised learningAlgorithms + 2 measures (support and Algorithms + 2 measures (support and

confidence)confidence)

Problems:Problems:Enormous amount of rules (rough rules)Enormous amount of rules (rough rules)Few semantic on support and confidence Few semantic on support and confidence

measuresmeasures Need to help the user select the best rulesNeed to help the user select the best rules

AR QualityAR Quality

Association RulesAssociation Rules

Solutions:Solutions:Redundancy reductionRedundancy reductionStructuring (classes, close rules)Structuring (classes, close rules) Improve quality measuresImprove quality measures Interactive decision aid (rule mining)Interactive decision aid (rule mining)

AR QualityAR Quality

Association RulesAssociation Rules

InputInput : data : data p Boolean attributes (Vp Boolean attributes (V00, V, V11, … V, … Vpp) (columns)) (columns) n transactions (rows)n transactions (rows)

OutputOutput : Association Rules: : Association Rules: Implicative tendencies : X Implicative tendencies : X Y Y

X and Y (itemsets) ex: VX and Y (itemsets) ex: V00^V^V44^V^V88 V V11Negative examplesNegative examples

2 measures:2 measures:Support: supp(XSupport: supp(XY) = freq(XY) = freq(XUUY)Y)Confidence: conf(XConfidence: conf(XY) = P(Y|X) = freq(XY) = P(Y|X) = freq(XUUY)/freq(X)Y)/freq(X)Algorithm properties (monotony)Algorithm properties (monotony)Ex: Couches Ex: Couches beer (supp=20%, conf=90%) beer (supp=20%, conf=90%)(NB: max nb of rules 3(NB: max nb of rules 3pp))

AR QualityAR Quality

Limits of SupportLimits of Support

Support: supp(XSupport: supp(XY) = freq(XY) = freq(XUUY)Y) Generality of the ruleGenerality of the rule Minimum support threshold (ex: 10%)Minimum support threshold (ex: 10%) Reduce the complexityReduce the complexity Lose nuggets (support pruning)Lose nuggets (support pruning)

Nugget:Nugget: Specific rule (low support)Specific rule (low support) Valid rule (high confidence)Valid rule (high confidence) High potential of novelty/surpriseHigh potential of novelty/surprise

AR QualityAR Quality

Limits of ConfidenceLimits of Confidence

Confidence: conf(XConfidence: conf(XY) = P(Y|X) = freq(XY) = P(Y|X) = freq(XUUY)/freq(X)Y)/freq(X) Validity/logical aspect of the rule (inclusion)Validity/logical aspect of the rule (inclusion) Minimal confidence threshold (ex: 90%)Minimal confidence threshold (ex: 90%) Reduces the amount of extracted rulesReduces the amount of extracted rules Interestingness /= validityInterestingness /= validity No detection of independenceNo detection of independence

IndependenceIndependence:: X and Y are independent: P(Y|X) = P(Y)X and Y are independent: P(Y|X) = P(Y) If P(Y) is high => nonsense rule with high supportIf P(Y) is high => nonsense rule with high support

Ex: Couches Ex: Couches beer (supp=20%, conf=90%) if supp(beer)=90% beer (supp=20%, conf=90%) if supp(beer)=90%

AR QualityAR Quality

[Guillaume et al. 1998], [Lallich et al. 2004][Guillaume et al. 1998], [Lallich et al. 2004]

Limits of the Pair Limits of the Pair Support-ConfidenceSupport-Confidence

In practice:In practice:High support threshold (10%)High support threshold (10%)High confidence threshold (90%)High confidence threshold (90%)

Valid and general rulesValid and general rules

Common Sense but not noveltyCommon Sense but not novelty

Efficient measures but insufficient to captureEfficient measures but insufficient to capture

qualityquality

AR QualityAR Quality

Subjective MeasuresSubjective Measures

CriteriaCriteria

User-oriented measures (U)User-oriented measures (U)

Quality : interestingness:Quality : interestingness:Unexpectedness [Silberschatz 1996]Unexpectedness [Silberschatz 1996]

Unknown or contradictory ruleUnknown or contradictory rule

Actionability (Usefulness) [Piatesky-shapiro Actionability (Usefulness) [Piatesky-shapiro 1994]1994]Usefulness for decision making, gainUsefulness for decision making, gain

Anticipation [Roddick 2001]Anticipation [Roddick 2001]Prediction on temporal dimensionPrediction on temporal dimension

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

CriteriaCriteria

Unexpectedness and actionability:Unexpectedness and actionability:Unexpected + useful = high interestingnessUnexpected + useful = high interestingnessExpected + non-useful = ?Expected + non-useful = ?Expected + useful = reinforcementExpected + useful = reinforcementUnexpected + non-useful = ?Unexpected + non-useful = ?

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

PrinciplePrinciple

Algorithm principle:Algorithm principle:1.1. Extraction of the decision maker KnowledgeExtraction of the decision maker Knowledge2.2. Formalization of the knowledge (K) (expected and actionable)Formalization of the knowledge (K) (expected and actionable)3.3. KDD (K’)KDD (K’)4.4. Compare K and K’Compare K and K’5.5. Select (subjective measures) the rules Select (subjective measures) the rules ΔΔ(K,K’) of K’ which are:(K,K’) of K’ which are:

Differ the most from K (unexpectedness)Differ the most from K (unexpectedness) Or are the most similar (actionability)Or are the most similar (actionability)

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

Rule TemplatesRule Templates

User knowledge (K): syntactic constraintsUser knowledge (K): syntactic constraints Patterns/forms of rules: A1, A2, …, AkPatterns/forms of rules: A1, A2, …, AkAk+1Ak+1 Ai: constraints on attribute Vi (interval of values)Ai: constraints on attribute Vi (interval of values) K: K1 + K2K: K1 + K2

K1: interesting patterns (select)K1: interesting patterns (select) K2: not interesting patterns (reject)K2: not interesting patterns (reject)

Goal: select the interesting rules inside K’Goal: select the interesting rules inside K’

Boolean Criterion:Boolean Criterion: Rules XRules XY of K’ satisfying K1 patterns but not K2 Y of K’ satisfying K1 patterns but not K2

ones + constraints (threshold) support, confidence, ones + constraints (threshold) support, confidence, rule size (| Xrule size (| XUUY |)Y |)

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

[Klemettinen et al. 1994][Klemettinen et al. 1994]

InterestingnessInterestingness

User knowledge (K): beliefsUser knowledge (K): beliefs A set K of beliefs (bayes rules)A set K of beliefs (bayes rules) A belief (rule) aA belief (rule) a K weighted by p(∈K weighted by p(∈ αα)) K: K1 + K2K: K1 + K2

K1: hard beliefs (p(K1: hard beliefs (p(αα) constant)) constant) K2: soft beliefs (p(K2: soft beliefs (p(αα) can vary)) can vary)

Goal: make beliefs K2 varying in function of the part of K’ Goal: make beliefs K2 varying in function of the part of K’ which satisfy K1which satisfy K1

+ Interest criterion of R=X+ Interest criterion of R=XY of K’: Y of K’:

Change weight Change weight αα: :

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

[Sibershatz & Tuzhilin 1995][Sibershatz & Tuzhilin 1995]

Logical ContradictionLogical Contradiction

User Knowledge (K):User Knowledge (K):A set of K rulesA set of K rules

Goal: select unexpected rules in K’Goal: select unexpected rules in K’

Unexpected criterion:Unexpected criterion:Rule ARule AB of K’ and XB of K’ and XY of KY of KAAB is unexpected if:B is unexpected if:

B and Y are contradictory (p(B and Y)=0)B and Y are contradictory (p(B and Y)=0)(A and X) is frequent (p(A and Y) high)(A and X) is frequent (p(A and Y) high)(A and X) (A and X) B is true (hence (A and X) B is true (hence (A and X)not Y not Y

also (exception!))also (exception!))

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

[Padmanabhan and Tuzhilin 1998][Padmanabhan and Tuzhilin 1998]

Attribute CostsAttribute Costs

User Knowledge (K): costsUser Knowledge (K): costs Cost of each attribute/item Ai: Cost(Ai)Cost of each attribute/item Ai: Cost(Ai)

Goal: select the costless rules in K’Goal: select the costless rules in K’

Cost of a rule:Cost of a rule: Rule A1, A2, …, AkRule A1, A2, …, AkBB Low mean cost:Low mean cost:

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

[Freitas 1999][Freitas 1999]

Other Subjective MeasuresOther Subjective Measures Projected Savings (KEFIR system’s interestingness) Projected Savings (KEFIR system’s interestingness)

[Matheus & Piatetsky-Shapiro 1994][Matheus & Piatetsky-Shapiro 1994] Fuzzy Matching Interestingness Measure [Lie et al. Fuzzy Matching Interestingness Measure [Lie et al.

1996]1996] General Impression [Liu et al. 1997]General Impression [Liu et al. 1997] Logical Contradiction [Padmanabhan & Tuzhilin’s 1997]Logical Contradiction [Padmanabhan & Tuzhilin’s 1997] Misclassification Costs [Frietas 1999]Misclassification Costs [Frietas 1999] Vague Feelings (Fuzzy General Impressions) [Liu et al. Vague Feelings (Fuzzy General Impressions) [Liu et al.

2000]2000] Anticipation [Roddick and rice 2001]Anticipation [Roddick and rice 2001] Interestingness [Shekar & Natarajan’s 2001]Interestingness [Shekar & Natarajan’s 2001]

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

ClassificationClassificationAR Quality : Subjective MeasuresAR Quality : Subjective Measures

Interestingness MeasureInterestingness Measure YearYear ApplicationApplication FoundationFoundation ScopeScope Subjective Subjective AspectsAspects

User’s Knowledge User’s Knowledge RepresentationRepresentation

11 Matheus and Piatetsky-Matheus and Piatetsky-Shapiro’s Projected Shapiro’s Projected SavingsSavings

19941994 SummariesSummaries UtilitarianUtilitarian Single Single RuleRule

UnexpectednessUnexpectedness Pattern DeviationPattern Deviation

22 Klemettinen et al. Rule Klemettinen et al. Rule TemplatesTemplates

19941994 Association Association RulesRules

SyntacticSyntactic Single Single RuleRule

Unexpectedness Unexpectedness & Actionability& Actionability

Rule TemplatesRule Templates

33 Silbershatz and Tuzhilin’s Silbershatz and Tuzhilin’s InterestingnessInterestingness

19951995 Format Format IndependentIndependent

ProbabilisticProbabilistic Rule SetRule Set UnexpectednessUnexpectedness Hard & Soft BeliefsHard & Soft Beliefs

44 Liu et al. Fuzzy Matching Liu et al. Fuzzy Matching Interestingness MeasureInterestingness Measure

19961996 Classification Classification rulesrules

Syntactic Syntactic DistanceDistance

Single Single RuleRule

UnexpectednessUnexpectedness Fuzzy RulesFuzzy Rules

55 Liu et al. General Liu et al. General ImpressionsImpressions

19971997 Classification Classification RulesRules

SyntacticSyntactic Single Single RuleRule

UnexpectednessUnexpectedness GI, RPKGI, RPK

66 Padmanabhan and Tuzhilin Padmanabhan and Tuzhilin Logical ContradictionLogical Contradiction

19971997 Association Association RulesRules

Logical, StatisticLogical, Statistic Single Single RuleRule

UnexpectednessUnexpectedness Beliefs XBeliefs XYY

77 Freitas’ Attributes CostsFreitas’ Attributes Costs 19991999 Association Association RulesRules

UtilitarianUtilitarian Single Single RuleRule

ActionabilityActionability Costs ValuesCosts Values

88 Freitas’ Misclassification Freitas’ Misclassification CostsCosts

19991999 Association Association rulesrules

UtilitarianUtilitarian Single ruleSingle rule ActionabilityActionability Costs ValuesCosts Values

99 Liu et al. Vague Feelings Liu et al. Vague Feelings (Fuzzy General (Fuzzy General Impressions)Impressions)

20002000 Generalized Generalized Association Association RulesRules

SyntacticSyntactic Single Single RuleRule

UnexpectednessUnexpectedness GI, RPK, PKGI, RPK, PK

1010 Roddick and Rice’s Roddick and Rice’s AnticipationAnticipation

20012001 Format Format IndependentIndependent

ProbabilisticProbabilistic Single Single RuleRule

Temporal Temporal DimensionDimension

Probability GraphProbability Graph

1111 Shekar and Natarajan’s Shekar and Natarajan’s InterestingnessInterestingness

20022002 Association Association RulesRules

DistanceDistance Single Single RuleRule

UnexpectednessUnexpectedness Fuzzy-graph based Fuzzy-graph based taxonomytaxonomy

ConclusionConclusion

Algorithm + Measures to compare K and K’Algorithm + Measures to compare K and K’ Focus on interesting rulesFocus on interesting rules Knowledge is Domain specificKnowledge is Domain specific

Acquisition of K?Acquisition of K? Hard task to represent knowledge and goals of Hard task to represent knowledge and goals of

the decision makerthe decision maker

Many improvements to makeMany improvements to make

AR Quality : Subjective MeasuresAR Quality : Subjective Measures

Objective MeasuresObjective Measures

Principles and ClassificationPrinciples and Classification

PrinciplePrinciple

Statistics on data D (transactions) for each rule Statistics on data D (transactions) for each rule R=XR=XYY

Interestingness measure = i(R,D,H)Interestingness measure = i(R,D,H)Degree of satisfaction of the hypothesis H in D Degree of satisfaction of the hypothesis H in D

independently of Uindependently of U

AR Quality : Objective MeasuresAR Quality : Objective Measures

ContingencyContingencyRule XRule X with X and Y disjoined itemsets with X and Y disjoined itemsets Inclusion of E(X) in E(Y)Inclusion of E(X) in E(Y)

5 observable parameters in E:5 observable parameters in E: n=|E|n=|E| amount of transactionsamount of transactions nnxx=|E(X)|=|E(X)| cardinal of the premise (left hand side)cardinal of the premise (left hand side) nnyy=|E(Y)|=|E(Y)| cardinal of the conclusion (right hand side)cardinal of the conclusion (right hand side) nnxyxy=|E(X and Y)|=|E(X and Y)| number of positive examplesnumber of positive examples nnxx¬¬yy=|E(X and =|E(X and ¬Y)|¬Y)| number of negative examplesnumber of negative examples

AR Quality : Objective MeasuresAR Quality : Objective Measures

IndependenceIndependence

p(X) estimated by (frequency) p(X) estimated by (frequency) Hypothesis of Independence of X and Y:Hypothesis of Independence of X and Y:

Inclusion /= dependence

AR Quality : Objective MeasuresAR Quality : Objective Measures

Equiprobability (Equilibrium)Equiprobability (Equilibrium)

Rule XRule XYY Same amount of negative examples (e-) and Same amount of negative examples (e-) and

positive examples (e+):positive examples (e+):hence when:hence when:2 situations:2 situations:

(or P(Y|X)>0.5): e+ higher: rule (or P(Y|X)>0.5): e+ higher: rule XXYY

(or P(Y|X)<0.5): e- higher: rule (or P(Y|X)<0.5): e- higher: rule XX¬Y¬Y

Contra-positive Contra-positive ¬X¬X¬Y¬Y

AR Quality : Objective MeasuresAR Quality : Objective Measures

Interestingness Measure - Interestingness Measure - DefinitionDefinition

i(Xi(XY) = f(n, nY) = f(n, nxx, n, nyy, n, nxyxy))

General principles:General principles:Semantic and readability for the userSemantic and readability for the user Increasing value with the qualityIncreasing value with the qualitySensibility to equiprobability (inclusion)Sensibility to equiprobability (inclusion)Statistic Likelihood (confidence in the Statistic Likelihood (confidence in the

measure itself)measure itself)Noise resistance, time stabilityNoise resistance, time stabilitySurprisingness, nuggets ?Surprisingness, nuggets ?

AR Quality : Objective MeasuresAR Quality : Objective Measures

Properties in the LiteratureProperties in the Literature

Properties of i(XProperties of i(XY) = f(n, nY) = f(n, nxx, n, nyy, n, nxyxy)) [Piatetsky-Shapiro 1991] (strong rules):[Piatetsky-Shapiro 1991] (strong rules):

(P1) =0 if X and Y are independent(P1) =0 if X and Y are independent (P2) increases with examples n(P2) increases with examples nxyxy

(P3) decreases with premise n(P3) decreases with premise nxx (or conclusion n (or conclusion nyy)(?))(?)

[Major & Mangano 1993]:[Major & Mangano 1993]: (P4) increases with n(P4) increases with nxyxy when confidence is constant (n when confidence is constant (nxyxy/n/nxx))

[Freitas 1999]:[Freitas 1999]: (P5) asymmetry (i(X(P5) asymmetry (i(XY)/=i(YY)/=i(YX))X)) Small disjunctions (nuggets)Small disjunctions (nuggets)

[Tan et al. 2002], [Hilderman & Hamilton 2001] and [Gras et al. 2004][Tan et al. 2002], [Hilderman & Hamilton 2001] and [Gras et al. 2004]

AR Quality : Objective MeasuresAR Quality : Objective Measures

Selected PropertiesSelected Properties Inclusion and equiprobabilityInclusion and equiprobability

0, interval of security0, interval of security IndependenceIndependence

0, interval of security0, interval of security Bounded maximum valueBounded maximum value

Comparability, global threshold, inclusionComparability, global threshold, inclusion Non linearityNon linearity

Noise Resistance, interval of security for independence and Noise Resistance, interval of security for independence and equiprobabilityequiprobability

SensibilitySensibility N (nuggets), dilation (likelihood)N (nuggets), dilation (likelihood)

Frequency p(X) Frequency p(X) cardinal n cardinal nxx Reinforcement by similar rules (contra-positive, negative Reinforcement by similar rules (contra-positive, negative

rule,…)rule,…) [Smyth & Goodman 1991][Kodratoff 2001][Gras et al 2001][Gras et al. 2004][Smyth & Goodman 1991][Kodratoff 2001][Gras et al 2001][Gras et al. 2004]

AR Quality : Objective MeasuresAR Quality : Objective Measures

What Could Be a Good Measure?What Could Be a Good Measure?

Negative-examples nNegative-examples nxx¬y¬y

IImaxmax + independence + equiprobability + independence + equiprobability constraints upon other dimensionsconstraints upon other dimensions

AR Quality : Objective MeasuresAR Quality : Objective Measures

Consequences On Other Consequences On Other DimensionsDimensions

Conclusion nConclusion nyy

Decrease with nDecrease with nyy (n (ny y n: Ind ↓) n: Ind ↓)

Size of data nSize of data nIncrease with dilation (Ind ↑)Increase with dilation (Ind ↑)

Increase with n (Ind Increase with n (Ind ↑)↑)

AR Quality : Objective MeasuresAR Quality : Objective Measures

ListListAR Quality : Objective MeasuresAR Quality : Objective Measures

ClassificationClassification

Classification between Classification between three criteria:three criteria:Object of the indexObject of the index

Concept measured by the indexConcept measured by the indexRange of the indexRange of the index

Entity concerned with measurementEntity concerned with measurementNature of the indexNature of the index

Statistical or descriptive character of the indexStatistical or descriptive character of the index

AR Quality : Objective MeasuresAR Quality : Objective Measures

ClassificationClassification

The Object: Certain indices take a fixed value with

independence. P(a ∩ b) = P(a) x P(b) They evaluate a variation with independence

Certain indices take a fixed value with equilibrium. P(a ∩ b) = P(a)/2 They evaluate a variation with equilibrium

Others do not take a fixed value with independence or with equilibrium Statistical indices

AR Quality : Objective MeasuresAR Quality : Objective Measures

ClassificationClassification

The Range:The Range: Certain indices evaluate to more than a simple rule:

They relate simultaneously to a rule and its contra-positive:I(a b) = I(¬¬b ¬¬ a)

Indices of quasi-Involvement

They simultaneously relate a rule and its reciprocal:

I(a b) = I(b a) Indices of quasi-conjunction

They relate simultaneously to all three: I(a b) = I(b a) = I(¬¬ b ¬¬ a)

Indices of quasi-equivalence

AR Quality : Objective MeasuresAR Quality : Objective Measures

ClassificationClassification

The Nature:The Nature:

If variation : statistical indexIf not : descriptive index

AR Quality : Objective MeasuresAR Quality : Objective Measures

ClassificationClassificationAR Quality : Objective MeasuresAR Quality : Objective Measures

List Of Quality MeasuresList Of Quality Measures Monodimensional e+, e-

Support [Agrawal et al. 1996] Ralambrodrainy [Ralambrodrainy, 1991]

Bidimensional - Inclusion Descriptive-Confirm [Yves Kodratoff, 1999] Sebag et Schoenauer [Sebag, Schoenauer, 1991] Examples neg examples ratio (*)

Bidimensional – Inclusion – Conditional Probability Confidence [Agrawal et al. 1996] Wang index [Wang et al., 1988] Laplace (*)

Bidimensional – Analogous Rules Descriptive Confirmed-Confidence [Yves Kodratoff, 1999] (*)

AR Quality : Objective MeasuresAR Quality : Objective Measures

List Of Quality MeasuresList Of Quality Measures Tridimensional – Analogous Rules

Causal Support [Kodratoff, 1999] Causal Confidence [Kodratoff, 1999] (*) Causal Confirmed-Confidence [Kodratoff, 1999] Least contradiction [Aze & Kodratoff 2004] (*)

Tridimensional – Linear - Independent Pavillon index [Pavillon, 1991] Rule Interest [Piatetsky-Shapiro, 1991] (*) Pearl index [Pearl, 1988], [Acid et al., 1991] [Gammerman, Luo, 1991] Correlation [Pearson 1996] (*) Loevinger index [Loevinger, 1947] (*) Certainty factor [Tan & Kumar 2000] Rate of connection[Bernard et Charron 1996] Interest factor [Brin et al., 1997] Top spin(*) Cosine [Tan & Kumar 2000] (*) Kappa [Tan & Kumar 2000]

AR Quality : Objective MeasuresAR Quality : Objective Measures

List Of Quality MeasuresList Of Quality Measures Tridimensional – Nonlinear – Independent

Chi squared distance Logarithmic lift [Church & Hanks, 1990] (*) Predictive association [Tan & Kumar 2000] (Goodman & Kruskal) Conviction [Brin et al., 1997b] Odd’s ratio [Tan & Kumar 2000] Yule’Q [Tan & Kumar 2000] Yule’s Y [Tan & Kumar 2000] Jaccard [Tan & Kumar 2000] Klosgen [Tan & Kumar 2000] Interestingness [Gray & Orlowska, 1998] Mutual information ratio (Uncertainty) [Tan et al., 2002] J-measure [Smyth & Goodman 1991] [Goodman & Kruskal 1959] (*) Gini [Tan et al., 2002] General measure of rule interestingness [Jaroszewicz & Simovici, 2001] (*)

AR Quality : Objective MeasuresAR Quality : Objective Measures

Quadridimensional – Linear – independent Lerman index of similarity[Lerman, 1981] Index of Involvement[Gras, 1996]

Quadridimensional – likeliness (conditional probability?) of dependence Probability of error of Chi2 (*) Intensity of Involvement [Gras, 1996] (*)

Quadridimensional – Inclusion – dependent – analogous rules Entropic intensity of Involvement [Gras, 1996] (*) TIC [Blanchard et al., 2004] (*)

Others Surprisingness (*) [Freitas, 1998] + rules of exception [Duval et al. 2004] + rule distance, similarity [Dong & Li 1998]

AR Quality : Objective MeasuresAR Quality : Objective Measures

List Of Quality MeasuresList Of Quality Measures

Objective MeasuresObjective Measures

Simulations and Properties

AR Quality : Objective MeasuresAR Quality : Objective Measures

Monodimensional Measures e+ e-Monodimensional Measures e+ e-

Definition :

Semantics : degree of general information

Sensitivity: 1 parameter Measuring frequency Linear Insensitive to

independence Disequilibrium? Symmetrical

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Support [Agrawal et al. 1996]Support [Agrawal et al. 1996]

Monodimensional Measures e+ e-Monodimensional Measures e+ e-

Definition :

Semantics: scarcity of the e-

Sensitivity: 1 parameter Measuring frequency Linear Insensitive to

independence Disequilibrium? Increasing

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Ralambrodrainy Measure [Ralambrodrainy 1991]Ralambrodrainy Measure [Ralambrodrainy 1991]

Bidimensional Measures - InclusionBidimensional Measures - Inclusion

Definition:

Semantics: variation e+ e- (improved support)

Sensitivity: 2 parameters Measuring frequency Linear Insensitive to

independence 0 with disequilibrium

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Descriptive-Confirm [Kodratoff 1999]Descriptive-Confirm [Kodratoff 1999]

Bidimensional Measures - InclusionBidimensional Measures - Inclusion

Definition:

Semantics: ratio e+/e- Sensitivity: 2 parameters Measuring frequency Non-Linear (very selective) Insensitive to

independence 1 with disequilibrium Max value not limited

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Sebag and Schoenauer [Sebag & Schoenauer, 1991]Sebag and Schoenauer [Sebag & Schoenauer, 1991]

Bidimensional Measures - InclusionBidimensional Measures - Inclusion

Definition:

Semantics: ratio e+/e- Sensitivity: 2 parameters Measuring frequency Non-linear (tolerance) Insensitive to

independence 0 with disequilibrium Max value limited

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Example and Counterexample Rate (*)Example and Counterexample Rate (*)

Bidimensional Measures - InclusionBidimensional Measures - Inclusion

Definition:

Semantics: inclusion, validity Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0.5 with disequilibrium Max value limited Variations:

[Ganascia, 1991] : Charade Or Descriptive Confirmed-

Confidence [Kodratoff, 1999]

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Confidence [Agrawal et al. 1996]Confidence [Agrawal et al. 1996]

Bidimensional Measures - InclusionBidimensional Measures - Inclusion

Definition:

Semantics: improved support (threshold of confidence integrated)

Sensitivity: 2 parameters Measuring frequency Linear Insensitive to

independence Disequilibrium?

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Wang [Wang et al 1988]Wang [Wang et al 1988]

Bidimensional Measures - InclusionBidimensional Measures - Inclusion

Definition:

Semantics: estimates confidence (decreases with lowering support)

Sensitivity: 2 parameters Does not measure

frequency when numbers are small

Linear Insensitive to

independence Max value limited

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Laplace [Clark & Robin 1991], [Tan & Kumar 2000]Laplace [Clark & Robin 1991], [Tan & Kumar 2000]

Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules

Definition:

Semantics: confidence confirmed by its negative (X¬¬Y)

Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0 with disequilibrium Max value limited Reinforcement by the

negative rule

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Descriptive Confirmed-Confidence [Kondratoff 1999]Descriptive Confirmed-Confidence [Kondratoff 1999]

Definition:

Semantics: support improved by the use of the contra-positive

Sensitivity: 3 parameters Measuring frequency Linear Insensitive to

independence Disequilibrium? Reinforcement by the

contra-positive rule

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Casual Support [Kodratoff 1999]Casual Support [Kodratoff 1999]Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules

Definition:

Semantics: confidence reinforced by the contra-positive

Sensitivity: 3 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Max value limited Reinforcement by the

contra-positive rule Evolution: Causal-Confirmed

Confidence: contra-positive + negative

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Casual Confidence [Kodratoff 1999]Casual Confidence [Kodratoff 1999]Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules

Definition:

Semantics: little-contradiction Sensitivity: 3 parameters Measuring frequency Linear 0 with Disequilibrium Supports inclusive

measurement Reinforcement by the

negative rule Coupled with an algorithm

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Least Contradiction [Aze & Kodratoff 2004]Least Contradiction [Aze & Kodratoff 2004]Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules

Definition:

Semantics: variation with independence, correction of the size of the conclusion

Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Disequilibrium?

Called Added Value in [Tan et al. 2002]

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Centered Confidence (Pavillon Index) [Pavillon 1991]Centered Confidence (Pavillon Index) [Pavillon 1991]Tridimensional Measures-IndependenceTridimensional Measures-Independence

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition:Definition:

Semantics: gap to Semantics: gap to independence (strong rules) independence (strong rules)

Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Disequilibrium?? Alternative symmetric Alternative symmetric

Measure: Measure: Pear [Pearl, 1988], [Acid et al., Pear [Pearl, 1988], [Acid et al.,

1991] [GAMMERMAN, Luo, 1991] [GAMMERMAN, Luo, 1991] 1991]

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Rule Interest [Piatetsky-Shapiro 1991]Rule Interest [Piatetsky-Shapiro 1991]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics:Semantics: CorrelationCorrelation Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Disequilibrium?Disequilibrium?

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Coefficient of Correlation [Pearson 1996]Coefficient of Correlation [Pearson 1996]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics: dependence Semantics: dependence implicativeimplicative

Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Maximum value bounded Maximum value bounded

(inclusion)(inclusion) Disequilibrium?Disequilibrium? Equivalent measure: Certainty Equivalent measure: Certainty

factor [Tan & Kumar 2000]: factor [Tan & Kumar 2000]:

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Loevinger (*) [Loevinger 1947]Loevinger (*) [Loevinger 1947]Certainty Factor [Tan & Kumar 2000]Certainty Factor [Tan & Kumar 2000]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics: dependenceSemantics: dependence Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium?Disequilibrium?

Variations:Variations: Measurement of interest (interest Measurement of interest (interest

factor) [Brin et al., 1997]factor) [Brin et al., 1997] Equivalent to LiftEquivalent to Lift Alternative: Logarithmic Measure Alternative: Logarithmic Measure

of lift [Church & Hanks, 1990]of lift [Church & Hanks, 1990]

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Varying Rates Liaison [Bernard & Charron 1996]Varying Rates Liaison [Bernard & Charron 1996]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definitions:Definitions: Measure of Interest Measure of Interest

(Interest Factor):(Interest Factor):

LiftLift

Logarithmic Logarithmic Measure of Lift:Measure of Lift:

Cosine:Cosine:

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Measure of Interest (Interest Factor) [Brin et al. 1997]Measure of Interest (Interest Factor) [Brin et al. 1997]Lift (*)Lift (*)

Logarithmic Measure of Lift (*) [Church & Hanks 1990]Logarithmic Measure of Lift (*) [Church & Hanks 1990]Cosine (*) [Tan & Kumar 2000]Cosine (*) [Tan & Kumar 2000]

Semantics: dependenceSemantics: dependence Sensitivity: 3 parametersSensitivity: 3 parameters Measuring FrequencyMeasuring Frequency LinearLinear Inclusion?Inclusion? Disequilibrium?Disequilibrium?

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics:Semantics: Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when Independent0 when Independent Disequilibrium?Disequilibrium? Maximum valueMaximum value Strengthened by Strengthened by

contra-positivecontra-positive

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Kappa [Tan & Kumar 2000]Kappa [Tan & Kumar 2000]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition:Definition: with with

Semantics: X good prediction Semantics: X good prediction for Yfor Y

Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Linear piecewiseLinear piecewise 0 when independent?0 when independent? Maximum value?Maximum value? Disequilibrium?Disequilibrium?

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Predictive Association (*) [Tan & Kumar 2000] Predictive Association (*) [Tan & Kumar 2000] (Goodman & Kruskal)(Goodman & Kruskal)

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics: convictionSemantics: conviction Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non Linear (very Non Linear (very

selective)selective) 1 when independent1 when independent Maximum value not Maximum value not

merelymerely Disequilibrium? Disequilibrium?

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Conviction [Brin et al. 1997b]Conviction [Brin et al. 1997b]

(shape similar to Sebag and Schoenauer [Sebag & Schoenauer 1991] (shape similar to Sebag and Schoenauer [Sebag & Schoenauer 1991] except for independence) except for independence)

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definitions: Definitions:

Odds Ratio:Odds Ratio:

Yule’s Q:Yule’s Q: Yule’s Y: Yule’s Y:

Semantics: correlationSemantics: correlation Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non Linear (resistance to noise?) Non Linear (resistance to noise?) 1 or 0 when independent1 or 0 when independent Bounded max value (1 or not)Bounded max value (1 or not) Disequilibrium?Disequilibrium? Strengthened by the similar Strengthened by the similar

rules rules

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Odds Ratio, Yule’s Q, Yule’s Y [Tan & Kumar 2000]Odds Ratio, Yule’s Q, Yule’s Y [Tan & Kumar 2000]

(Close Conviction)

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definitions: Definitions:

Jaccard:Jaccard:

Klosgen:Klosgen:

Semantics: correlation Semantics: correlation Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non LinearNon Linear 0 when independent0 when independent Bounded max value (0 or 1)Bounded max value (0 or 1) Disequilibrium?Disequilibrium? Strengthened by similar rulesStrengthened by similar rules

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Jaccard, Klosgen [Tan & Kumar 2000]Jaccard, Klosgen [Tan & Kumar 2000]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition:Definition:

Semantics: interest?Semantics: interest? Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non LinearNon Linear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium?Disequilibrium?

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Interestingness Weighting Dependency [Gray & Orlowska 1998]Interestingness Weighting Dependency [Gray & Orlowska 1998]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics: information gain provided by X for YSemantics: information gain provided by X for Y Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non linear, entropicNon linear, entropic 0 when independent0 when independent Inclusion? Disequilibrium?Inclusion? Disequilibrium? Strongly SymmetricStrongly Symmetric Low valueLow value

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Mutual Information (Uncertainty) [Tan et al 2002]Mutual Information (Uncertainty) [Tan et al 2002]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics: cross entropy (by Semantics: cross entropy (by mutual information)mutual information)

Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non linear, entropicNon linear, entropic O when Independent + concaveO when Independent + concave Inclusion? Disequilibrium?Inclusion? Disequilibrium? SymmetricSymmetric Low valueLow value Strengthened by the negative Strengthened by the negative

(X(X¬Y)¬Y)

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

J-Measure (*) [Smyth & Goodman 1991][Goodman & Kruskal 1959]J-Measure (*) [Smyth & Goodman 1991][Goodman & Kruskal 1959]

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

Semantics: quadratic Semantics: quadratic entropyentropy

Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non linear, entropicNon linear, entropic 0 when Independent + 0 when Independent +

concaveconcave Inclusion? Inclusion?

Disequilibrium?Disequilibrium? Very SymmetricVery Symmetric Low valueLow value

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Gini IndexGini Index

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition: Definition:

(continuum of measure (continuum of measure between Gini and Chi2) between Gini and Chi2)

Semantics:?Semantics:? Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non-Linear (Gini-> distance Non-Linear (Gini-> distance

from the Chi-2)from the Chi-2) 0 when independent0 when independent Inclusion? Disequilibrium?Inclusion? Disequilibrium? Not Symmetric -> SymmetricNot Symmetric -> Symmetric

ΔΔαα: Family measures differences : Family measures differences conditioned by a factor conditioned by a factor real real αα (Gini -> Distance from chi2) (Gini -> Distance from chi2)

ΔΔXX(resp. (resp. ΔΔYY): distribution of vector ): distribution of vector X and (resp. Y)X and (resp. Y)

ΔΔxyxy: vector distribution of X and : vector distribution of X and attached Yattached Y

ΔΔXX x x ΔΔYY: vector distribution of : vector distribution of attached X and Y under the attached X and Y under the hypothesis of independencehypothesis of independence

θθ: vector apriori distribution of Y : vector apriori distribution of Y

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

General Measure of Rule Interestingness (*) General Measure of Rule Interestingness (*) [Jaroszewicz & Simovici 2001][Jaroszewicz & Simovici 2001]

Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence

Definition: Definition:

Semantics: number of examples normalized centeredSemantics: number of examples normalized centered Sensitivity: 4 parametersSensitivity: 4 parameters Measurement statistics (numbers)Measurement statistics (numbers) LinearLinear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium? Disequilibrium?

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Lerman Similarity [Lerman 1981]Lerman Similarity [Lerman 1981]

Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence

Definition: Definition:

Semantics: number of normalized counter-examplesSemantics: number of normalized counter-examples Sensitivity: 4 parametersSensitivity: 4 parameters Measurement statistics (numbers)Measurement statistics (numbers) LinearLinear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium? Disequilibrium?

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Variation: Implication Index [Gras 1996]Variation: Implication Index [Gras 1996]

Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence

Definition: Definition:

(probabilistic modeling, law chi2)(probabilistic modeling, law chi2) Semantics: probability of a Semantics: probability of a

dependence between X and Ydependence between X and Y Sensitivity: 4 parametersSensitivity: 4 parameters Measuring probability, not Measuring probability, not

frequencyfrequency Non Linear + e- toleranceNon Linear + e- tolerance 0 when independent + real0 when independent + real Maximum value boundedMaximum value bounded inclusion? Disequilibrium?inclusion? Disequilibrium? Strongly Symmetric => Coupling Strongly Symmetric => Coupling

measure of interest [Brin et al., measure of interest [Brin et al., 1997] 1997]

Alternative: Report likelihood Alternative: Report likelihood [Ritschard & al., 1998] [Ritschard & al., 1998]

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Lerman Similarity [Lerman 1981]Lerman Similarity [Lerman 1981]

Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence

Definition: Definition:

(probabilistic modeling, law of counter-(probabilistic modeling, law of counter-examples)examples)

Semantics: likely the scarcity of counter-Semantics: likely the scarcity of counter-examples (Statistical astonishment)examples (Statistical astonishment)

Sensitivity: 4 parametersSensitivity: 4 parameters Measuring probability, not frequencyMeasuring probability, not frequency Non Linear + e-toleranceNon Linear + e-tolerance 0.5 when independent + likelihood 0.5 when independent + likelihood Maximum value boundedMaximum value bounded inclusion? Disequilibrium?inclusion? Disequilibrium?

Logic rules:Logic rules: Can be 0Can be 0

Inspired by Link Likelihood [Lerman et al Inspired by Link Likelihood [Lerman et al 1981]1981]

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Intensity of Implication (*)[Gras 1996] (Analysis of Statistical Intensity of Implication (*)[Gras 1996] (Analysis of Statistical Involvement)Involvement)

Modeling:

Binary variables => numerical, ordinal, intervals, fuzzy [Bernadet 2000, Guillaume 2002, ...]

Bulky data: intensity of entropic Involvement [Gras et al. 2001]

Sequences: rules of prediction [Blanchard et al. 2002]

Intensity Of Involvement and Intensity Of Involvement and Analysis Of Implicative StatisticsAnalysis Of Implicative Statistics

Structuring:

Hierarchy implicative (cohesion) [Gras et al. 2001]

Typical, reduction of variables (inertia of Involvement) [Gras et al. 2002]

AR Quality : Objective MeasuresAR Quality : Objective Measures

ExtensionsExtensions

ApplicationsApplications

CHIC (http://www.ardm.asso.fr/CHIC.html) SIPINA (University of Lyon 2) FELIX (PerformanSE SA)

Quadridimensional RulesQuadridimensional Rules Definition: Definition:

Inclusion Rate:Inclusion Rate:

Information:Information: (increases with (increases with ) )

Asymmetric entropy: the entropy H’(Y|X) decreases with p(Y|X)Asymmetric entropy: the entropy H’(Y|X) decreases with p(Y|X) Semantics: Surprising Statistic + inclusion (removal of disequilibrium)Semantics: Surprising Statistic + inclusion (removal of disequilibrium) Sensitivity: 4 parametersSensitivity: 4 parameters Measuring frequency non-probabilisticMeasuring frequency non-probabilistic Non linear + tolerance e- (adjustment of the selectivity with Non linear + tolerance e- (adjustment of the selectivity with αα (ex: (ex:

αα=2)=2) Max 0.5 when independent + realMax 0.5 when independent + real 0 when in disequilibrium0 when in disequilibrium Strengthened by the contra-positiveStrengthened by the contra-positive Maximum value bounded (1) Maximum value bounded (1)

AR Quality : Objective MeasuresAR Quality : Objective Measures

Entropy (*) [Gras et al 2001] (Analysis of Statistical Involvement)Entropy (*) [Gras et al 2001] (Analysis of Statistical Involvement)

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition:Definition:

Information Rate:Information Rate:

Asymmetric Entropy: The entropy Ê(X) with p(X)Asymmetric Entropy: The entropy Ê(X) with p(X) Semantics: Surprise Statistic + inclusion (removal of Semantics: Surprise Statistic + inclusion (removal of

disequilibrium)disequilibrium) Sensitivity: 4 parametersSensitivity: 4 parameters Measuring frequencyMeasuring frequency Non-linear, entropicNon-linear, entropic 0 to independence0 to independence 0 to Imbalance0 to Imbalance Strengthened by the contra-positiveStrengthened by the contra-positive Maximum value bounded (1)Maximum value bounded (1)

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

TIC (*) [Blanchard et al.2004] (Analysis of Statistical Involvement)TIC (*) [Blanchard et al.2004] (Analysis of Statistical Involvement)

Tridimensional Measures-IndependenceTridimensional Measures-Independence

Definition:Definition: Rule: XRule: X11 X X22 X X33 … X … Xp-1p-1 X Xpp Y Y

Information gain provided by theInformation gain provided by theattribute Xi:attribute Xi:

Conditional entropy:Conditional entropy:

Semantics: surprise gain informational resources provided by Semantics: surprise gain informational resources provided by the premisethe premise

Measuring frequencyMeasuring frequency Non-Linear: entropicNon-Linear: entropic Can be used to assess individual contribution of each Can be used to assess individual contribution of each

attribute ...attribute ...

Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures

Surprisingness (*) [Freitas 1998]Surprisingness (*) [Freitas 1998]

Comparative TheoryComparative Theory

Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation

Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate

Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation

Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate

Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation

Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate

Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation

Confidence, PS, Intensity of ImplicationConfidence, PS, Intensity of Implication

TICTICComparison by SimulationComparison by Simulation

Confidence, TIM, J-Measure, Gini IndexConfidence, TIM, J-Measure, Gini Index

TICTICComparison by SimulationComparison by Simulation

Confidence, TIM, J-Measure, Gini IndexConfidence, TIM, J-Measure, Gini Index

TICTICComparison by SimulationComparison by Simulation

Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate

Comparison by SimulationComparison by Simulation

Comparison by SimulationComparison by Simulation

Comparison by SimulationComparison by Simulation

Comparison by SimulationComparison by Simulation

Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation

Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate

Synthesis & Comparative Studies Synthesis & Comparative Studies

[Bayardo and Agrawal, 1999]: influence of support[Bayardo and Agrawal, 1999]: influence of support 9 measures, monotonous functions / antitones support, optimization9 measures, monotonous functions / antitones support, optimization

[Hilderman and Hamilton, 2001]: Interest summaries[Hilderman and Hamilton, 2001]: Interest summaries 16 measures, 5 principles of independence, correlation study16 measures, 5 principles of independence, correlation study

[Azé and Kodratoff, 2001]: resistance to noise in the data [Azé and Kodratoff, 2001]: resistance to noise in the data [Tan & Kumar 2000]: interest association rules[Tan & Kumar 2000]: interest association rules

9 symmetric measures, study of the relationship observed between 2 9 symmetric measures, study of the relationship observed between 2 measurements, influence of supportmeasurements, influence of support

[Tan et al., 2002]: association rules interest[Tan et al., 2002]: association rules interest 21 measures symmetrical 8 principles study of correlation, influence the 21 measures symmetrical 8 principles study of correlation, influence the

media media [Gras et al. 04]: interest association rules[Gras et al. 04]: interest association rules

10 criteria 10 criteria [Lenca et al., 2004]: association rules interest[Lenca et al., 2004]: association rules interest

20 measures, 8 criteria for decision support multi-criteria 20 measures, 8 criteria for decision support multi-criteria [Lallich & Teytaud 2004]: association rules interest[Lallich & Teytaud 2004]: association rules interest

15 measures, 10 principles, learning and using the VC-dimension 15 measures, 10 principles, learning and using the VC-dimension

Quality of Rules : Subjective MeasuresQuality of Rules : Subjective Measures

Study of Comparative Study of Comparative ExperimentsExperiments

Project AR-QAT : Quality Measures Project AR-QAT : Quality Measures Analysis ToolAnalysis Tool

Experimental ResultsExperimental Results

30 Objective Measures30 Objective Measures

Input Data SetsInput Data Sets

Experimental Results – Positive Experimental Results – Positive CorrelationsCorrelations

Experimental Results – Positive Experimental Results – Positive CorrelationsCorrelations

Experimental ResultsExperimental ResultsStable Strong Positive CorrelationsStable Strong Positive Correlations

Average CorrelationAverage Correlation

ARVAL ARVAL A workshop for calculating A workshop for calculating

quality measures quality measures for the scientific community for the scientific community

http://www.univ-nantes.fr/arval

ARVALARVAL

ARVALARVAL

ARVALARVAL

ARVALARVAL

ConclusionConclusion

Conclusion and OutlooksConclusion and Outlooks Quality = multidimensional concept: Quality = multidimensional concept:

Subjective (maker)Subjective (maker) Interest = changes with the knowledge of the decision-makerInterest = changes with the knowledge of the decision-maker PB1PB1: extract knowledge / objective decision-maker: extract knowledge / objective decision-maker

Objective (data and rules)Objective (data and rules) Interest = on the Hypothetical Data: Inclusion, Independence, Interest = on the Hypothetical Data: Inclusion, Independence,

Imbalance, nuggets, robustness ...Imbalance, nuggets, robustness ... Antagonism Independence / DisequilibriumAntagonism Independence / Disequilibrium Many indices (~ 50!) =>Many indices (~ 50!) =>

PB2PB2: restricted to support / confidence => workshop for : restricted to support / confidence => workshop for calculating indicescalculating indices

PB3PB3: comparative study (properties, simulations) and : comparative study (properties, simulations) and experimental (behavior data): a platform?experimental (behavior data): a platform?

PB4PB4: combining the clues, choose the right index => Decision : combining the clues, choose the right index => Decision SupportSupport

PB5PB5: new clues?: new clues? PB6PB6: What is a good index? (ingredients of quality) : What is a good index? (ingredients of quality)

Perspective (PB1) Perspective (PB1)

Search for Search for knowledgeknowledge

Anthropocentric Anthropocentric approachapproach

Adaptive ExtractionAdaptive Extraction

FFELIX [Lehn et. Al ELIX [Lehn et. Al 1999] 1999]

AR-VIS [Blanchard et AR-VIS [Blanchard et al. 2003]al. 2003]

Ax: Quality Assessment of KnowledgeAx: Quality Assessment of Knowledge

Combining Subjective and Objective Aspects of QualityCombining Subjective and Objective Aspects of Quality

Perspective (PB 2 3 4 5)Perspective (PB 2 3 4 5)

Calculation: ARVAL? (Calculation: ARVAL? (www.polytech.univ-nantes.fr/arvalwww.polytech.univ-nantes.fr/arval)) Analysis: AR-QAT? [Popovici 2003]Analysis: AR-QAT? [Popovici 2003] Decision Support: HERBS? [Lenca et al. 2003] (www-Decision Support: HERBS? [Lenca et al. 2003] (www-

iasc.enst-bretagne.fr/ecd-ind/HERBS)iasc.enst-bretagne.fr/ecd-ind/HERBS)

Ax: Quality Assessment of KnowledgeAx: Quality Assessment of Knowledge

Platform for experimentation, support and a decisionPlatform for experimentation, support and a decision

BibliographyBibliography [Agrawal et al., 1993] R. Agrawal, T. Imielinsky et A. Swami. Mining associations rules between sets of items in large databases. Proc. of

ACM SIGMOD'93, 1993, p. 207-216 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Evaluation de la résistance au bruit de quelques mesures d'extraction de règles

d'association. Extraction des connaissances et apprentissage 1(4), 2001, p. 143-154 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Extraction de « pépites » de connaissances dans les données : une nouvelle approche et

une étude de sensibilité au bruit. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Bayardo & Agrawal, 1999] R.J. Bayardo et R. Agrawal. Mining the most interesting rules. Proc. of the 5th Int. Conf. on Knowledge

Discovery and Data Mining, 1999, p.145-154. [Bernadet 2000] M. Bernardet. Basis of a fuzzy knowledge discovery system. Proc. of Principles of Data Mining and Knowledge

Discovery, LNAI 1510, pages 24-33. Springer, 2000. [Bernard et Charron 1996] J.-M. Bernard et C. Charron. L’analyse implicative bayésienne, une méthode pour l’étude des dépendances

orientées. I. Données binaires, Revue Mathématique Informatique et Sciences Humaines (MISH), vol. 134, 1996, p. 5-38. [Berti-Equille 2004] L. Berti-équille. Etat de l'art sur la qualité des données : un premier pas vers la qualité des connaissances. Rapport

d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Blanchard et al. 2001] J. Blanchard, F. Guillet, et H. Briand. L'intensité d'implication entropique pour la recherche de règles de

prédiction intéressantes dans les séquences de pannes d'ascenseurs. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4):77-88, 2002.

[Blanchard et al. 2003] J. Blanchard, F. Guillet, F. Rantière, H. Briand. Vers une Représentation Graphique en Réalité Virtuelle pour la Fouille Interactive de Règles d’Association. Extraction des Connaissances et Apprentissage (ECA), vol. 17, n°1-2-3, 105-118, 2003. Hermès Science Publication. ISSN 0992-499X, ISBN 2-7462-0631-5

[Blanchard et al. 2003a] J. Blanchard, F. Guillet, H. Briand. Une visualisation orientée qualité pour la fouille anthropocentrée de règles d’association. In Cognito - Cahiers Romans de Sciences Cognitives. A paraître. ISSN 1267-8015

[Blanchard et al. 2003b] J. Blanchard, F. Guillet, H. Briand. A User-driven and Quality oriented Visualiation for Mining Association Rules. In Proc. Of the Third IEEE International Conference on Data Mining, ICDM’2003, Melbourne, Florida, USA, November 19 - 22, 2003.

[Blanchard et al., 2004] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesurer la qualité des règles et de leurs contraposées avec le taux informationnel TIC. EGC2004, RNTI, Cépaduès. 2004 A paraître.

[Blanchard et al., 2004a] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesure de la qualité des règles d'association par l'intensité d'implication entropique. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].

[Breiman & al. 1984] L.Breiman, J. Friedman, R. Olshen and C.Stone. Classification and Regression Trees. Chapman & Hall,1984. [Briand et al. 2004] H. Briand, M. Sebag, G. Gras et F. Guillet (eds). Mesures de Qualité pour la fouille de données. Revue des

Nouvelles Technologies de l’Information, RNTI, Cépaduès, 2004. A paraître. [Brin et al., 1997] S. Brin, R. Motwani and C. Silverstein. Beyond Market Baskets: Generalizing Association Rules to Correlations. In

Proceedings of SIGMOD’97, pages 265-276, AZ, USA, 1997. [Brin et al., 1997b] S. Brin, R. Motwani, J. Ullman et S. Tsur. Dynamic itemset counting and implication rules for market basket data.

Proc. of the Int. Conf. on Management of Data, ACM Press, 1997, p. 255-264.

BibliographyBibliography [Church & Hanks, 1990] K. W. Church et P. Hanks. Word association norms, mutual information and lexicography. Computational

Linguistics, 16(1), 22-29, 1990. [Clark & Robin 1991] Peter Clark and Robin Boswell: Rule Induction with CN2: Some Recent Improvements. In Proceeding of the

European Working Session on Learning EWSL-91, 1991. [Dong & Li, 1998] G. Dong and J. Li. Interestingness of Discovered Association Rules in terms of Neighborhood-Based Unexpectedness.

In X. Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), Melbourne, Australia, April 1998.

[Duval et al. 2004] B. Duval, A. Salleb, C. Vrain. Méthodes et mesures d’intérêt pour l’extraction de règles d’exception. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].

[Fleury 1996] L. Fleury. Découverte de connaissances pour la gestion des ressources humaines. Thèse de doctorat, Université de Nantes, 1996.

[Frawley & Piatetsky-Shapiro 1992] Frawley W. Piatetsky-Shapiro G. and Matheus C., « Knowledge discovery in databases: an overview », AI Magazine, 14(3), 1992, pages 57-70

[Freitas, 1998] A. A. Freitas. On Objective Measures of Rule Suprisingness. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD `98), pages 1-9, Nantes, France, September 1998.

[Freitas, 1999] A. Freitas. On rule interestingness measures. Knowledge-Based Systems Journal 12(5-6), 1999, p. 309-315. [Gago & Bento, 1998 ] P. Gago and C. Bento. A Metric for Selection of the Most Promising Rules. PKDD’98, 1998. [Gray & Orlowska, 1998] B. Gray and M. E. Orlowska. Ccaiia: Clustering Categorical Attributes into Interesting Association Rules. In X.

Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), pages 132 43, Melbourne, Australia, April 1998.

[Goodman & Kruskal 1959] L. A. Goodman andW. H. Kruskal. Measures of Association for Cross Classification, ii: Further discussion and references. Journal of the American Statistical Association, ??? 1959.

[Gras et al. 1995] R. Gras, H. Briand and P. Peter. Structuration sets with implication intensity. Proc. of the Int. Conf. On Ordinal and Symbolic Data Analysis - OSDA 95. Springer, 1995.

[Gras, 1996] R. Gras et coll.. L'implication statistique - Nouvelle méthode exploratoire de données. La pensée sauvage éditions, 1996. [Gras et al. 2001] R. Gras, P. Kuntz, et H. Briand. Les fondements de l'analyse statistique implicative et quelques prolongements pour la

fouille de données. Mathématiques et Sciences Humaines : Numéro spécial Analyse statistique implicative, 1(154-155) :9-29, 2001. [Gras et al. 2001b] R. Gras, P. Kuntz, R. Couturier, et F. Guillet. Une version entropique de l'intensité d'implication pour les corpus

volumineux. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(1-2) :69-80, 2001. [Gras et al. 2002] R. Gras, F. Guillet, et J. Philippe. Réduction des colonnes d'un tableau de données par quasi-équivalence entre

variables. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4) :197-202, 2002. [Gras et al. 2004] R. Gras, R. Couturier, J. Blanchard, H. Briand, P. Kuntz, P. Peter. Quelques critères pour une mesure de la qualité des

règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Guillaume et al. 1998] S. Guillaume, F. Guillet, J. Philippé. Improving the discovery of associations Rules with Intensity of implication.

Proc. of 2nd European Symposium Principles of data Mining and Knowledge Discovery, LNAI 1510, p 318-327. Springer 1998. [Guillaume 2002] S. Guillaume. Discovery of Ordinal Association Rules. M.-S. Cheng, P. S. Yu, B. Liu (Eds.), Proc. Of the 6th Pacific- sia

Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2002, LNCS 2336, pages 322-327 Springer 2002.

BibliographyBibliography [Guillet et al. 1999] F. Guillet, P. Kuntz, et R. Lehn. A genetic algorithm for visualizing networks of association rules. Proc. the 12th Int.

Conf. On Industrial and Engineering Appl. of AI and Expert Systems, LNCS 1611, pages 145-154. Springer 1999 [Guillet 2000] F. Guillet. Mesures de qualité de règles d’association. Cours DEA-ECD. Ecole polytechnique de l’université de Nantes.

2000. [Hilderman & Hamilton, 1998] R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Interestingness Measures: A Survey.

(KDD `98), ??? New-York 1998. [Hilderman et Hamilton, 2001] R. Hilderman et H. Hamilton. Knowledge discovery and measures of interest. Kluwer Academic

publishers, 2001. [Hussain et al. 2001] F. Hussain, H. Liu, E. Suzuki and H. Lu. Exception Rule Mining with a Relative Interestingness Measure. ??? [Jaroszewicz & Simovici, 2001] S. Jaroszewicz et D.A. Simovici. A general measure of rule interestingness. Proc. of the 7th Int. Conf.

on Knowledge Discovery and Data Mining, L.N.C.S. 2168, Springer, 2001, p. 253-265 [Klemettinen et al. 1994] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A. I. Verkamo. Finding Interesting Rules from

Large Sets of Discovered Association Rules. In N. R. Adam, B. K. Bhargava and Y. Yesha, editors, Proc. of the Third International Conf. on Information and Knowledge Management``, pages 401-407, Gaitersburg, Maryland, 1994.

[Kodratoff, 1999] Y. Kodratoff. Comparing Machine Learning and Knowledge Discovery in Databases:An Application to Knowledge Discovery in Texts. Lecture Notes on AI (LNAI)-Tutorial series. 2000.

[Kuntz et al. 2000] P.Kuntz, F.Guillet, R.Lehn and H.Briand. A User-Driven Process for Mining Association Rules. In D. Zighed, J. Komorowski and J.M. Zytkow (Eds.), Principles of Data Mining and Knowledge Discovery (PKDD2000), Lecture Notes in Computer Science, vol. 1910, pages 483-489, 2000. Springer.

[Kodratoff, 2001] Y. Kodratoff. Comparing machine learning and knowledge discovery in databases: an application to knowledge discovery in texts. Machine Learning and Its Applications, Paliouras G., Karkaletsis V., Spyropoulos C.D. (eds.), L.N.C.S. 2049, Springer, 2001, p. 1-21.

[Kuntz et al. 2001] P. Kuntz, F. Guillet, R. Lehn and H. Briand. A user-driven process for mining association rules. Proc. of Principles of Data Mining and Knowledge Discovery, LNAI 1510, pages 483-489. Springer, 2000.

[Kuntz et al. 2001b] P. Kuntz, F. Guillet, R. Lehn, et H. Briand. Vers un processus d'extraction de règles d'association centré sur l'utilisateur. In Cognito, Revue francophone internationale en sciences cognitives, 1(20) :13-26, 2001.

[Lallich et al. 2004] S. Lallich et O. Teytaud . Évaluation et validation de l’intérêt des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].

[Lehn et al. 1999] R.Lehn, F.Guillet, P.Kuntz, H.Briand and J. Philippé. Felix : An interactive rule mining interface in a kdd process. In P. Lenca (editor), Proc. of the 10th Mini-Euro Conference, Human Centered Processes, HCP’99, pages 169-174, Brest, France, September 22-24, 1999.

[Lenca et al. 2004] P. Lenca, P. Meyer, B. Vaillant, P. Picouet, S. Lallich. Evaluation et analyse multi-critères des mesures de qualité des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].

[Lerman et al. 1981] I. C. Lerman, R. Gras et H. Rostam. Elaboration et évaluation d’un indice d’implication pour les données binaires. Revue Mathématiques et Sciences Humaines, 75, p. 5-35, 1981.

[Lerman, 1981] I. C. Lerman. Classification et analyse ordinale des données. Paris, Dunod 1981. [Lerman, 1993] I. C. Lerman. Likelihood linkage analysis classification method, Biochimie 75, p. 379-397, 1993. [Lerman & Azé 2004] I. C. Lerman et J. Azé.Indidice probabiliste discriminant de vraisemblance du lien pour des données volumineuses.

Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].

BibliographyBibliography [Liu et al., 1999] B. Liu, W. Hsu, L. Mun et H. Lee. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge

and Data Engineering 11, 1999, p. 817-832. [Loevinger, 1947] J. Loevinger. A systemic approach to the construction and evaluation of tests of ability. Psychological monographs,

61(4), 1947. [Mannila & Pavlov, 1999] H. Mannila and D. Pavlov. Prediction with Local Patterns using Cross-Entropy. Technical Report, Information

and Computer Science, University of California, Irvine, 1999. [Matheus & Piatetsky-Shapiro, 1996] C. J. Matheus and G. Piatetsky-Shapiro. Selecting and Reporting what is Interesting: The KEFIR

Application to Healthcare data. In U. M. Fayyad, G. Piatetsky-Shapiro, P.Smyth and R. Uthurusamy (eds), Advances in Knowledge Discovery and Data Mining, p. 401-419, 1996. AAAI Press/MIT Press. [Meo 2000] R. Meo. Theory of dependence values, ACM Transactions on Database Systems 5(3), p. 380-406, 2000.

[Padmanabhan et Tuzhilin, 1998] B. Padmanabhan et A. Tuzhilin. A belief-driven method for discovering unexpected patterns. Proc. Of the 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998, p. 94-100.

[Pearson, 1896] K. Pearson. Mathematical contributions to the theory of evolution. III. regression, heredity and panmixia. Philosophical Transactions of the Royal Society, vol. A, 1896.

[Piatestsky-Shapiro, 1991] G. Piatestsky-Shapiro. Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases. Piatetsky-Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 229-248

[Popovici, 2003] E. Popovici. Un atelier pour l'évaluation des indices de qualité. Mémoire de D.E.A. E.C.D., IRIN/Université Lyon2/RACAI Bucarest, Juin 2003

[Ritschard & al., 1998] G. Ritschard, D. A. Zighed and N. Nicoloyannis. Maximiser l`association par agrégation dans un tableau croisé. In J. Zytkow and M. Quafafou, editors, Proc. of the Second European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD `98), Nantes, France, September 1998.

[Sebag et Schoenauer, 1988] M. Sebag et M. Schoenauer. Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases. Proc. of the European Knowledge Acquisition Workshop (EKAW'88), Boose J., Gaines B., Linster M. (eds.), Gesellschaft für Mathematik und Datenverarbeitung mbH, 1988, p. 28.1-28.20.

[Shannon & Weaver, 1949] C.E. Shannon et W. Weaver. The mathematical theory of communication. University of Illinois Press, 1949. [Silbershatz &Tuzhilin,1995] Avi Silberschatz and Alexander Tuzhilin. On Subjective Measures of Interestingness in Knowledge

Discovery, (KD. & DM. `95) ??? , 1995. [Smyth & Goodman, 1991] P. Smyth et R.M. Goodman. Rule induction using information theory. Knowledge Discovery in Databases,

Piatetsky- Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 159-176 [Tan & Kumar 2000] P. Tan, V. Kumar. Interestingness Measures for Association Patterns : A Perspective. Workshop tutorial (KDD

2000). [Tan et al., 2002] P. Tan, V. Kumar et J. Srivastava. Selecting the right interestingness measure for association patterns. Proc. of the 8th

Int. Conf. on Knowledge Discovery and Data Mining, 2002, p. 32-41.