interestingness measures. quality in kdd levels of quality quality of discovered knowledge :...
TRANSCRIPT
Levels of QualityLevels of Quality
Quality of discovered knowledge : f(D,M,U)Quality of discovered knowledge : f(D,M,U)Data Quality (D)Data Quality (D)
Noise, accuracy, missing values, bad values, … Noise, accuracy, missing values, bad values, … [Berti-Equille 2004][Berti-Equille 2004]
Model Quality (M)Model Quality (M)Accuracy, generalization, relevance, …Accuracy, generalization, relevance, …
User-based Quality (U)User-based Quality (U)Relevance for decision makingRelevance for decision making
User-based QualityUser-based Quality
2 categories:2 categories: Objective (D, M)Objective (D, M)
Computed from data onlyComputed from data only
Subjective (U)Subjective (U) Hypothesis : goal, domain knowledgeHypothesis : goal, domain knowledge Hard to formalize (novelty)Hard to formalize (novelty)
Examples of Quality CriteriaExamples of Quality Criteria
7 criteria of interest (interestingness) [Hussein 7 criteria of interest (interestingness) [Hussein 2000]:2000]:
Objective:Objective: Generality : (ex: Support)Generality : (ex: Support) Validity: (ex: Confidence)Validity: (ex: Confidence) Reliability: (ex: High generality and validity)Reliability: (ex: High generality and validity)
Subjective:Subjective: Common Sense: reliable + known yetCommon Sense: reliable + known yet Actionability : utility for decisionActionability : utility for decision Novelty: previously unknownNovelty: previously unknown Surprise (Unexpectedness): contradiction ?Surprise (Unexpectedness): contradiction ?
Objective MeasuresObjective Measures
Association RulesAssociation Rules
Association rules [Agrawal et al. 1993]:Association rules [Agrawal et al. 1993]:Market-basket analysisMarket-basket analysisNon supervised learningNon supervised learningAlgorithms + 2 measures (support and Algorithms + 2 measures (support and
confidence)confidence)
Problems:Problems:Enormous amount of rules (rough rules)Enormous amount of rules (rough rules)Few semantic on support and confidence Few semantic on support and confidence
measuresmeasures Need to help the user select the best rulesNeed to help the user select the best rules
AR QualityAR Quality
Association RulesAssociation Rules
Solutions:Solutions:Redundancy reductionRedundancy reductionStructuring (classes, close rules)Structuring (classes, close rules) Improve quality measuresImprove quality measures Interactive decision aid (rule mining)Interactive decision aid (rule mining)
AR QualityAR Quality
Association RulesAssociation Rules
InputInput : data : data p Boolean attributes (Vp Boolean attributes (V00, V, V11, … V, … Vpp) (columns)) (columns) n transactions (rows)n transactions (rows)
OutputOutput : Association Rules: : Association Rules: Implicative tendencies : X Implicative tendencies : X Y Y
X and Y (itemsets) ex: VX and Y (itemsets) ex: V00^V^V44^V^V88 V V11Negative examplesNegative examples
2 measures:2 measures:Support: supp(XSupport: supp(XY) = freq(XY) = freq(XUUY)Y)Confidence: conf(XConfidence: conf(XY) = P(Y|X) = freq(XY) = P(Y|X) = freq(XUUY)/freq(X)Y)/freq(X)Algorithm properties (monotony)Algorithm properties (monotony)Ex: Couches Ex: Couches beer (supp=20%, conf=90%) beer (supp=20%, conf=90%)(NB: max nb of rules 3(NB: max nb of rules 3pp))
AR QualityAR Quality
Limits of SupportLimits of Support
Support: supp(XSupport: supp(XY) = freq(XY) = freq(XUUY)Y) Generality of the ruleGenerality of the rule Minimum support threshold (ex: 10%)Minimum support threshold (ex: 10%) Reduce the complexityReduce the complexity Lose nuggets (support pruning)Lose nuggets (support pruning)
Nugget:Nugget: Specific rule (low support)Specific rule (low support) Valid rule (high confidence)Valid rule (high confidence) High potential of novelty/surpriseHigh potential of novelty/surprise
AR QualityAR Quality
Limits of ConfidenceLimits of Confidence
Confidence: conf(XConfidence: conf(XY) = P(Y|X) = freq(XY) = P(Y|X) = freq(XUUY)/freq(X)Y)/freq(X) Validity/logical aspect of the rule (inclusion)Validity/logical aspect of the rule (inclusion) Minimal confidence threshold (ex: 90%)Minimal confidence threshold (ex: 90%) Reduces the amount of extracted rulesReduces the amount of extracted rules Interestingness /= validityInterestingness /= validity No detection of independenceNo detection of independence
IndependenceIndependence:: X and Y are independent: P(Y|X) = P(Y)X and Y are independent: P(Y|X) = P(Y) If P(Y) is high => nonsense rule with high supportIf P(Y) is high => nonsense rule with high support
Ex: Couches Ex: Couches beer (supp=20%, conf=90%) if supp(beer)=90% beer (supp=20%, conf=90%) if supp(beer)=90%
AR QualityAR Quality
[Guillaume et al. 1998], [Lallich et al. 2004][Guillaume et al. 1998], [Lallich et al. 2004]
Limits of the Pair Limits of the Pair Support-ConfidenceSupport-Confidence
In practice:In practice:High support threshold (10%)High support threshold (10%)High confidence threshold (90%)High confidence threshold (90%)
Valid and general rulesValid and general rules
Common Sense but not noveltyCommon Sense but not novelty
Efficient measures but insufficient to captureEfficient measures but insufficient to capture
qualityquality
AR QualityAR Quality
CriteriaCriteria
User-oriented measures (U)User-oriented measures (U)
Quality : interestingness:Quality : interestingness:Unexpectedness [Silberschatz 1996]Unexpectedness [Silberschatz 1996]
Unknown or contradictory ruleUnknown or contradictory rule
Actionability (Usefulness) [Piatesky-shapiro Actionability (Usefulness) [Piatesky-shapiro 1994]1994]Usefulness for decision making, gainUsefulness for decision making, gain
Anticipation [Roddick 2001]Anticipation [Roddick 2001]Prediction on temporal dimensionPrediction on temporal dimension
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
CriteriaCriteria
Unexpectedness and actionability:Unexpectedness and actionability:Unexpected + useful = high interestingnessUnexpected + useful = high interestingnessExpected + non-useful = ?Expected + non-useful = ?Expected + useful = reinforcementExpected + useful = reinforcementUnexpected + non-useful = ?Unexpected + non-useful = ?
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
PrinciplePrinciple
Algorithm principle:Algorithm principle:1.1. Extraction of the decision maker KnowledgeExtraction of the decision maker Knowledge2.2. Formalization of the knowledge (K) (expected and actionable)Formalization of the knowledge (K) (expected and actionable)3.3. KDD (K’)KDD (K’)4.4. Compare K and K’Compare K and K’5.5. Select (subjective measures) the rules Select (subjective measures) the rules ΔΔ(K,K’) of K’ which are:(K,K’) of K’ which are:
Differ the most from K (unexpectedness)Differ the most from K (unexpectedness) Or are the most similar (actionability)Or are the most similar (actionability)
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
Rule TemplatesRule Templates
User knowledge (K): syntactic constraintsUser knowledge (K): syntactic constraints Patterns/forms of rules: A1, A2, …, AkPatterns/forms of rules: A1, A2, …, AkAk+1Ak+1 Ai: constraints on attribute Vi (interval of values)Ai: constraints on attribute Vi (interval of values) K: K1 + K2K: K1 + K2
K1: interesting patterns (select)K1: interesting patterns (select) K2: not interesting patterns (reject)K2: not interesting patterns (reject)
Goal: select the interesting rules inside K’Goal: select the interesting rules inside K’
Boolean Criterion:Boolean Criterion: Rules XRules XY of K’ satisfying K1 patterns but not K2 Y of K’ satisfying K1 patterns but not K2
ones + constraints (threshold) support, confidence, ones + constraints (threshold) support, confidence, rule size (| Xrule size (| XUUY |)Y |)
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
[Klemettinen et al. 1994][Klemettinen et al. 1994]
InterestingnessInterestingness
User knowledge (K): beliefsUser knowledge (K): beliefs A set K of beliefs (bayes rules)A set K of beliefs (bayes rules) A belief (rule) aA belief (rule) a K weighted by p(∈K weighted by p(∈ αα)) K: K1 + K2K: K1 + K2
K1: hard beliefs (p(K1: hard beliefs (p(αα) constant)) constant) K2: soft beliefs (p(K2: soft beliefs (p(αα) can vary)) can vary)
Goal: make beliefs K2 varying in function of the part of K’ Goal: make beliefs K2 varying in function of the part of K’ which satisfy K1which satisfy K1
+ Interest criterion of R=X+ Interest criterion of R=XY of K’: Y of K’:
Change weight Change weight αα: :
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
[Sibershatz & Tuzhilin 1995][Sibershatz & Tuzhilin 1995]
Logical ContradictionLogical Contradiction
User Knowledge (K):User Knowledge (K):A set of K rulesA set of K rules
Goal: select unexpected rules in K’Goal: select unexpected rules in K’
Unexpected criterion:Unexpected criterion:Rule ARule AB of K’ and XB of K’ and XY of KY of KAAB is unexpected if:B is unexpected if:
B and Y are contradictory (p(B and Y)=0)B and Y are contradictory (p(B and Y)=0)(A and X) is frequent (p(A and Y) high)(A and X) is frequent (p(A and Y) high)(A and X) (A and X) B is true (hence (A and X) B is true (hence (A and X)not Y not Y
also (exception!))also (exception!))
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
[Padmanabhan and Tuzhilin 1998][Padmanabhan and Tuzhilin 1998]
Attribute CostsAttribute Costs
User Knowledge (K): costsUser Knowledge (K): costs Cost of each attribute/item Ai: Cost(Ai)Cost of each attribute/item Ai: Cost(Ai)
Goal: select the costless rules in K’Goal: select the costless rules in K’
Cost of a rule:Cost of a rule: Rule A1, A2, …, AkRule A1, A2, …, AkBB Low mean cost:Low mean cost:
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
[Freitas 1999][Freitas 1999]
Other Subjective MeasuresOther Subjective Measures Projected Savings (KEFIR system’s interestingness) Projected Savings (KEFIR system’s interestingness)
[Matheus & Piatetsky-Shapiro 1994][Matheus & Piatetsky-Shapiro 1994] Fuzzy Matching Interestingness Measure [Lie et al. Fuzzy Matching Interestingness Measure [Lie et al.
1996]1996] General Impression [Liu et al. 1997]General Impression [Liu et al. 1997] Logical Contradiction [Padmanabhan & Tuzhilin’s 1997]Logical Contradiction [Padmanabhan & Tuzhilin’s 1997] Misclassification Costs [Frietas 1999]Misclassification Costs [Frietas 1999] Vague Feelings (Fuzzy General Impressions) [Liu et al. Vague Feelings (Fuzzy General Impressions) [Liu et al.
2000]2000] Anticipation [Roddick and rice 2001]Anticipation [Roddick and rice 2001] Interestingness [Shekar & Natarajan’s 2001]Interestingness [Shekar & Natarajan’s 2001]
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
ClassificationClassificationAR Quality : Subjective MeasuresAR Quality : Subjective Measures
Interestingness MeasureInterestingness Measure YearYear ApplicationApplication FoundationFoundation ScopeScope Subjective Subjective AspectsAspects
User’s Knowledge User’s Knowledge RepresentationRepresentation
11 Matheus and Piatetsky-Matheus and Piatetsky-Shapiro’s Projected Shapiro’s Projected SavingsSavings
19941994 SummariesSummaries UtilitarianUtilitarian Single Single RuleRule
UnexpectednessUnexpectedness Pattern DeviationPattern Deviation
22 Klemettinen et al. Rule Klemettinen et al. Rule TemplatesTemplates
19941994 Association Association RulesRules
SyntacticSyntactic Single Single RuleRule
Unexpectedness Unexpectedness & Actionability& Actionability
Rule TemplatesRule Templates
33 Silbershatz and Tuzhilin’s Silbershatz and Tuzhilin’s InterestingnessInterestingness
19951995 Format Format IndependentIndependent
ProbabilisticProbabilistic Rule SetRule Set UnexpectednessUnexpectedness Hard & Soft BeliefsHard & Soft Beliefs
44 Liu et al. Fuzzy Matching Liu et al. Fuzzy Matching Interestingness MeasureInterestingness Measure
19961996 Classification Classification rulesrules
Syntactic Syntactic DistanceDistance
Single Single RuleRule
UnexpectednessUnexpectedness Fuzzy RulesFuzzy Rules
55 Liu et al. General Liu et al. General ImpressionsImpressions
19971997 Classification Classification RulesRules
SyntacticSyntactic Single Single RuleRule
UnexpectednessUnexpectedness GI, RPKGI, RPK
66 Padmanabhan and Tuzhilin Padmanabhan and Tuzhilin Logical ContradictionLogical Contradiction
19971997 Association Association RulesRules
Logical, StatisticLogical, Statistic Single Single RuleRule
UnexpectednessUnexpectedness Beliefs XBeliefs XYY
77 Freitas’ Attributes CostsFreitas’ Attributes Costs 19991999 Association Association RulesRules
UtilitarianUtilitarian Single Single RuleRule
ActionabilityActionability Costs ValuesCosts Values
88 Freitas’ Misclassification Freitas’ Misclassification CostsCosts
19991999 Association Association rulesrules
UtilitarianUtilitarian Single ruleSingle rule ActionabilityActionability Costs ValuesCosts Values
99 Liu et al. Vague Feelings Liu et al. Vague Feelings (Fuzzy General (Fuzzy General Impressions)Impressions)
20002000 Generalized Generalized Association Association RulesRules
SyntacticSyntactic Single Single RuleRule
UnexpectednessUnexpectedness GI, RPK, PKGI, RPK, PK
1010 Roddick and Rice’s Roddick and Rice’s AnticipationAnticipation
20012001 Format Format IndependentIndependent
ProbabilisticProbabilistic Single Single RuleRule
Temporal Temporal DimensionDimension
Probability GraphProbability Graph
1111 Shekar and Natarajan’s Shekar and Natarajan’s InterestingnessInterestingness
20022002 Association Association RulesRules
DistanceDistance Single Single RuleRule
UnexpectednessUnexpectedness Fuzzy-graph based Fuzzy-graph based taxonomytaxonomy
ConclusionConclusion
Algorithm + Measures to compare K and K’Algorithm + Measures to compare K and K’ Focus on interesting rulesFocus on interesting rules Knowledge is Domain specificKnowledge is Domain specific
Acquisition of K?Acquisition of K? Hard task to represent knowledge and goals of Hard task to represent knowledge and goals of
the decision makerthe decision maker
Many improvements to makeMany improvements to make
AR Quality : Subjective MeasuresAR Quality : Subjective Measures
PrinciplePrinciple
Statistics on data D (transactions) for each rule Statistics on data D (transactions) for each rule R=XR=XYY
Interestingness measure = i(R,D,H)Interestingness measure = i(R,D,H)Degree of satisfaction of the hypothesis H in D Degree of satisfaction of the hypothesis H in D
independently of Uindependently of U
AR Quality : Objective MeasuresAR Quality : Objective Measures
ContingencyContingencyRule XRule X with X and Y disjoined itemsets with X and Y disjoined itemsets Inclusion of E(X) in E(Y)Inclusion of E(X) in E(Y)
5 observable parameters in E:5 observable parameters in E: n=|E|n=|E| amount of transactionsamount of transactions nnxx=|E(X)|=|E(X)| cardinal of the premise (left hand side)cardinal of the premise (left hand side) nnyy=|E(Y)|=|E(Y)| cardinal of the conclusion (right hand side)cardinal of the conclusion (right hand side) nnxyxy=|E(X and Y)|=|E(X and Y)| number of positive examplesnumber of positive examples nnxx¬¬yy=|E(X and =|E(X and ¬Y)|¬Y)| number of negative examplesnumber of negative examples
AR Quality : Objective MeasuresAR Quality : Objective Measures
IndependenceIndependence
p(X) estimated by (frequency) p(X) estimated by (frequency) Hypothesis of Independence of X and Y:Hypothesis of Independence of X and Y:
Inclusion /= dependence
AR Quality : Objective MeasuresAR Quality : Objective Measures
Equiprobability (Equilibrium)Equiprobability (Equilibrium)
Rule XRule XYY Same amount of negative examples (e-) and Same amount of negative examples (e-) and
positive examples (e+):positive examples (e+):hence when:hence when:2 situations:2 situations:
(or P(Y|X)>0.5): e+ higher: rule (or P(Y|X)>0.5): e+ higher: rule XXYY
(or P(Y|X)<0.5): e- higher: rule (or P(Y|X)<0.5): e- higher: rule XX¬Y¬Y
Contra-positive Contra-positive ¬X¬X¬Y¬Y
AR Quality : Objective MeasuresAR Quality : Objective Measures
Interestingness Measure - Interestingness Measure - DefinitionDefinition
i(Xi(XY) = f(n, nY) = f(n, nxx, n, nyy, n, nxyxy))
General principles:General principles:Semantic and readability for the userSemantic and readability for the user Increasing value with the qualityIncreasing value with the qualitySensibility to equiprobability (inclusion)Sensibility to equiprobability (inclusion)Statistic Likelihood (confidence in the Statistic Likelihood (confidence in the
measure itself)measure itself)Noise resistance, time stabilityNoise resistance, time stabilitySurprisingness, nuggets ?Surprisingness, nuggets ?
AR Quality : Objective MeasuresAR Quality : Objective Measures
Properties in the LiteratureProperties in the Literature
Properties of i(XProperties of i(XY) = f(n, nY) = f(n, nxx, n, nyy, n, nxyxy)) [Piatetsky-Shapiro 1991] (strong rules):[Piatetsky-Shapiro 1991] (strong rules):
(P1) =0 if X and Y are independent(P1) =0 if X and Y are independent (P2) increases with examples n(P2) increases with examples nxyxy
(P3) decreases with premise n(P3) decreases with premise nxx (or conclusion n (or conclusion nyy)(?))(?)
[Major & Mangano 1993]:[Major & Mangano 1993]: (P4) increases with n(P4) increases with nxyxy when confidence is constant (n when confidence is constant (nxyxy/n/nxx))
[Freitas 1999]:[Freitas 1999]: (P5) asymmetry (i(X(P5) asymmetry (i(XY)/=i(YY)/=i(YX))X)) Small disjunctions (nuggets)Small disjunctions (nuggets)
[Tan et al. 2002], [Hilderman & Hamilton 2001] and [Gras et al. 2004][Tan et al. 2002], [Hilderman & Hamilton 2001] and [Gras et al. 2004]
AR Quality : Objective MeasuresAR Quality : Objective Measures
Selected PropertiesSelected Properties Inclusion and equiprobabilityInclusion and equiprobability
0, interval of security0, interval of security IndependenceIndependence
0, interval of security0, interval of security Bounded maximum valueBounded maximum value
Comparability, global threshold, inclusionComparability, global threshold, inclusion Non linearityNon linearity
Noise Resistance, interval of security for independence and Noise Resistance, interval of security for independence and equiprobabilityequiprobability
SensibilitySensibility N (nuggets), dilation (likelihood)N (nuggets), dilation (likelihood)
Frequency p(X) Frequency p(X) cardinal n cardinal nxx Reinforcement by similar rules (contra-positive, negative Reinforcement by similar rules (contra-positive, negative
rule,…)rule,…) [Smyth & Goodman 1991][Kodratoff 2001][Gras et al 2001][Gras et al. 2004][Smyth & Goodman 1991][Kodratoff 2001][Gras et al 2001][Gras et al. 2004]
AR Quality : Objective MeasuresAR Quality : Objective Measures
What Could Be a Good Measure?What Could Be a Good Measure?
Negative-examples nNegative-examples nxx¬y¬y
IImaxmax + independence + equiprobability + independence + equiprobability constraints upon other dimensionsconstraints upon other dimensions
AR Quality : Objective MeasuresAR Quality : Objective Measures
Consequences On Other Consequences On Other DimensionsDimensions
Conclusion nConclusion nyy
Decrease with nDecrease with nyy (n (ny y n: Ind ↓) n: Ind ↓)
Size of data nSize of data nIncrease with dilation (Ind ↑)Increase with dilation (Ind ↑)
Increase with n (Ind Increase with n (Ind ↑)↑)
AR Quality : Objective MeasuresAR Quality : Objective Measures
ClassificationClassification
Classification between Classification between three criteria:three criteria:Object of the indexObject of the index
Concept measured by the indexConcept measured by the indexRange of the indexRange of the index
Entity concerned with measurementEntity concerned with measurementNature of the indexNature of the index
Statistical or descriptive character of the indexStatistical or descriptive character of the index
AR Quality : Objective MeasuresAR Quality : Objective Measures
ClassificationClassification
The Object: Certain indices take a fixed value with
independence. P(a ∩ b) = P(a) x P(b) They evaluate a variation with independence
Certain indices take a fixed value with equilibrium. P(a ∩ b) = P(a)/2 They evaluate a variation with equilibrium
Others do not take a fixed value with independence or with equilibrium Statistical indices
AR Quality : Objective MeasuresAR Quality : Objective Measures
ClassificationClassification
The Range:The Range: Certain indices evaluate to more than a simple rule:
They relate simultaneously to a rule and its contra-positive:I(a b) = I(¬¬b ¬¬ a)
Indices of quasi-Involvement
They simultaneously relate a rule and its reciprocal:
I(a b) = I(b a) Indices of quasi-conjunction
They relate simultaneously to all three: I(a b) = I(b a) = I(¬¬ b ¬¬ a)
Indices of quasi-equivalence
AR Quality : Objective MeasuresAR Quality : Objective Measures
ClassificationClassification
The Nature:The Nature:
If variation : statistical indexIf not : descriptive index
AR Quality : Objective MeasuresAR Quality : Objective Measures
List Of Quality MeasuresList Of Quality Measures Monodimensional e+, e-
Support [Agrawal et al. 1996] Ralambrodrainy [Ralambrodrainy, 1991]
Bidimensional - Inclusion Descriptive-Confirm [Yves Kodratoff, 1999] Sebag et Schoenauer [Sebag, Schoenauer, 1991] Examples neg examples ratio (*)
Bidimensional – Inclusion – Conditional Probability Confidence [Agrawal et al. 1996] Wang index [Wang et al., 1988] Laplace (*)
Bidimensional – Analogous Rules Descriptive Confirmed-Confidence [Yves Kodratoff, 1999] (*)
AR Quality : Objective MeasuresAR Quality : Objective Measures
List Of Quality MeasuresList Of Quality Measures Tridimensional – Analogous Rules
Causal Support [Kodratoff, 1999] Causal Confidence [Kodratoff, 1999] (*) Causal Confirmed-Confidence [Kodratoff, 1999] Least contradiction [Aze & Kodratoff 2004] (*)
Tridimensional – Linear - Independent Pavillon index [Pavillon, 1991] Rule Interest [Piatetsky-Shapiro, 1991] (*) Pearl index [Pearl, 1988], [Acid et al., 1991] [Gammerman, Luo, 1991] Correlation [Pearson 1996] (*) Loevinger index [Loevinger, 1947] (*) Certainty factor [Tan & Kumar 2000] Rate of connection[Bernard et Charron 1996] Interest factor [Brin et al., 1997] Top spin(*) Cosine [Tan & Kumar 2000] (*) Kappa [Tan & Kumar 2000]
AR Quality : Objective MeasuresAR Quality : Objective Measures
List Of Quality MeasuresList Of Quality Measures Tridimensional – Nonlinear – Independent
Chi squared distance Logarithmic lift [Church & Hanks, 1990] (*) Predictive association [Tan & Kumar 2000] (Goodman & Kruskal) Conviction [Brin et al., 1997b] Odd’s ratio [Tan & Kumar 2000] Yule’Q [Tan & Kumar 2000] Yule’s Y [Tan & Kumar 2000] Jaccard [Tan & Kumar 2000] Klosgen [Tan & Kumar 2000] Interestingness [Gray & Orlowska, 1998] Mutual information ratio (Uncertainty) [Tan et al., 2002] J-measure [Smyth & Goodman 1991] [Goodman & Kruskal 1959] (*) Gini [Tan et al., 2002] General measure of rule interestingness [Jaroszewicz & Simovici, 2001] (*)
AR Quality : Objective MeasuresAR Quality : Objective Measures
Quadridimensional – Linear – independent Lerman index of similarity[Lerman, 1981] Index of Involvement[Gras, 1996]
Quadridimensional – likeliness (conditional probability?) of dependence Probability of error of Chi2 (*) Intensity of Involvement [Gras, 1996] (*)
Quadridimensional – Inclusion – dependent – analogous rules Entropic intensity of Involvement [Gras, 1996] (*) TIC [Blanchard et al., 2004] (*)
Others Surprisingness (*) [Freitas, 1998] + rules of exception [Duval et al. 2004] + rule distance, similarity [Dong & Li 1998]
AR Quality : Objective MeasuresAR Quality : Objective Measures
List Of Quality MeasuresList Of Quality Measures
Objective MeasuresObjective Measures
Simulations and Properties
AR Quality : Objective MeasuresAR Quality : Objective Measures
Monodimensional Measures e+ e-Monodimensional Measures e+ e-
Definition :
Semantics : degree of general information
Sensitivity: 1 parameter Measuring frequency Linear Insensitive to
independence Disequilibrium? Symmetrical
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Support [Agrawal et al. 1996]Support [Agrawal et al. 1996]
Monodimensional Measures e+ e-Monodimensional Measures e+ e-
Definition :
Semantics: scarcity of the e-
Sensitivity: 1 parameter Measuring frequency Linear Insensitive to
independence Disequilibrium? Increasing
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Ralambrodrainy Measure [Ralambrodrainy 1991]Ralambrodrainy Measure [Ralambrodrainy 1991]
Bidimensional Measures - InclusionBidimensional Measures - Inclusion
Definition:
Semantics: variation e+ e- (improved support)
Sensitivity: 2 parameters Measuring frequency Linear Insensitive to
independence 0 with disequilibrium
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Descriptive-Confirm [Kodratoff 1999]Descriptive-Confirm [Kodratoff 1999]
Bidimensional Measures - InclusionBidimensional Measures - Inclusion
Definition:
Semantics: ratio e+/e- Sensitivity: 2 parameters Measuring frequency Non-Linear (very selective) Insensitive to
independence 1 with disequilibrium Max value not limited
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Sebag and Schoenauer [Sebag & Schoenauer, 1991]Sebag and Schoenauer [Sebag & Schoenauer, 1991]
Bidimensional Measures - InclusionBidimensional Measures - Inclusion
Definition:
Semantics: ratio e+/e- Sensitivity: 2 parameters Measuring frequency Non-linear (tolerance) Insensitive to
independence 0 with disequilibrium Max value limited
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Example and Counterexample Rate (*)Example and Counterexample Rate (*)
Bidimensional Measures - InclusionBidimensional Measures - Inclusion
Definition:
Semantics: inclusion, validity Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0.5 with disequilibrium Max value limited Variations:
[Ganascia, 1991] : Charade Or Descriptive Confirmed-
Confidence [Kodratoff, 1999]
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Confidence [Agrawal et al. 1996]Confidence [Agrawal et al. 1996]
Bidimensional Measures - InclusionBidimensional Measures - Inclusion
Definition:
Semantics: improved support (threshold of confidence integrated)
Sensitivity: 2 parameters Measuring frequency Linear Insensitive to
independence Disequilibrium?
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Wang [Wang et al 1988]Wang [Wang et al 1988]
Bidimensional Measures - InclusionBidimensional Measures - Inclusion
Definition:
Semantics: estimates confidence (decreases with lowering support)
Sensitivity: 2 parameters Does not measure
frequency when numbers are small
Linear Insensitive to
independence Max value limited
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Laplace [Clark & Robin 1991], [Tan & Kumar 2000]Laplace [Clark & Robin 1991], [Tan & Kumar 2000]
Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules
Definition:
Semantics: confidence confirmed by its negative (X¬¬Y)
Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0 with disequilibrium Max value limited Reinforcement by the
negative rule
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Descriptive Confirmed-Confidence [Kondratoff 1999]Descriptive Confirmed-Confidence [Kondratoff 1999]
Definition:
Semantics: support improved by the use of the contra-positive
Sensitivity: 3 parameters Measuring frequency Linear Insensitive to
independence Disequilibrium? Reinforcement by the
contra-positive rule
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Casual Support [Kodratoff 1999]Casual Support [Kodratoff 1999]Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules
Definition:
Semantics: confidence reinforced by the contra-positive
Sensitivity: 3 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Max value limited Reinforcement by the
contra-positive rule Evolution: Causal-Confirmed
Confidence: contra-positive + negative
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Casual Confidence [Kodratoff 1999]Casual Confidence [Kodratoff 1999]Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules
Definition:
Semantics: little-contradiction Sensitivity: 3 parameters Measuring frequency Linear 0 with Disequilibrium Supports inclusive
measurement Reinforcement by the
negative rule Coupled with an algorithm
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Least Contradiction [Aze & Kodratoff 2004]Least Contradiction [Aze & Kodratoff 2004]Bidimensional Measures–Similar RulesBidimensional Measures–Similar Rules
Definition:
Semantics: variation with independence, correction of the size of the conclusion
Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Disequilibrium?
Called Added Value in [Tan et al. 2002]
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Centered Confidence (Pavillon Index) [Pavillon 1991]Centered Confidence (Pavillon Index) [Pavillon 1991]Tridimensional Measures-IndependenceTridimensional Measures-Independence
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition:Definition:
Semantics: gap to Semantics: gap to independence (strong rules) independence (strong rules)
Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Disequilibrium?? Alternative symmetric Alternative symmetric
Measure: Measure: Pear [Pearl, 1988], [Acid et al., Pear [Pearl, 1988], [Acid et al.,
1991] [GAMMERMAN, Luo, 1991] [GAMMERMAN, Luo, 1991] 1991]
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Rule Interest [Piatetsky-Shapiro 1991]Rule Interest [Piatetsky-Shapiro 1991]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics:Semantics: CorrelationCorrelation Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Disequilibrium?Disequilibrium?
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Coefficient of Correlation [Pearson 1996]Coefficient of Correlation [Pearson 1996]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics: dependence Semantics: dependence implicativeimplicative
Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Maximum value bounded Maximum value bounded
(inclusion)(inclusion) Disequilibrium?Disequilibrium? Equivalent measure: Certainty Equivalent measure: Certainty
factor [Tan & Kumar 2000]: factor [Tan & Kumar 2000]:
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Loevinger (*) [Loevinger 1947]Loevinger (*) [Loevinger 1947]Certainty Factor [Tan & Kumar 2000]Certainty Factor [Tan & Kumar 2000]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics: dependenceSemantics: dependence Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium?Disequilibrium?
Variations:Variations: Measurement of interest (interest Measurement of interest (interest
factor) [Brin et al., 1997]factor) [Brin et al., 1997] Equivalent to LiftEquivalent to Lift Alternative: Logarithmic Measure Alternative: Logarithmic Measure
of lift [Church & Hanks, 1990]of lift [Church & Hanks, 1990]
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Varying Rates Liaison [Bernard & Charron 1996]Varying Rates Liaison [Bernard & Charron 1996]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definitions:Definitions: Measure of Interest Measure of Interest
(Interest Factor):(Interest Factor):
LiftLift
Logarithmic Logarithmic Measure of Lift:Measure of Lift:
Cosine:Cosine:
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Measure of Interest (Interest Factor) [Brin et al. 1997]Measure of Interest (Interest Factor) [Brin et al. 1997]Lift (*)Lift (*)
Logarithmic Measure of Lift (*) [Church & Hanks 1990]Logarithmic Measure of Lift (*) [Church & Hanks 1990]Cosine (*) [Tan & Kumar 2000]Cosine (*) [Tan & Kumar 2000]
Semantics: dependenceSemantics: dependence Sensitivity: 3 parametersSensitivity: 3 parameters Measuring FrequencyMeasuring Frequency LinearLinear Inclusion?Inclusion? Disequilibrium?Disequilibrium?
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics:Semantics: Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency LinearLinear 0 when Independent0 when Independent Disequilibrium?Disequilibrium? Maximum valueMaximum value Strengthened by Strengthened by
contra-positivecontra-positive
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Kappa [Tan & Kumar 2000]Kappa [Tan & Kumar 2000]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition:Definition: with with
Semantics: X good prediction Semantics: X good prediction for Yfor Y
Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Linear piecewiseLinear piecewise 0 when independent?0 when independent? Maximum value?Maximum value? Disequilibrium?Disequilibrium?
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Predictive Association (*) [Tan & Kumar 2000] Predictive Association (*) [Tan & Kumar 2000] (Goodman & Kruskal)(Goodman & Kruskal)
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics: convictionSemantics: conviction Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non Linear (very Non Linear (very
selective)selective) 1 when independent1 when independent Maximum value not Maximum value not
merelymerely Disequilibrium? Disequilibrium?
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Conviction [Brin et al. 1997b]Conviction [Brin et al. 1997b]
(shape similar to Sebag and Schoenauer [Sebag & Schoenauer 1991] (shape similar to Sebag and Schoenauer [Sebag & Schoenauer 1991] except for independence) except for independence)
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definitions: Definitions:
Odds Ratio:Odds Ratio:
Yule’s Q:Yule’s Q: Yule’s Y: Yule’s Y:
Semantics: correlationSemantics: correlation Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non Linear (resistance to noise?) Non Linear (resistance to noise?) 1 or 0 when independent1 or 0 when independent Bounded max value (1 or not)Bounded max value (1 or not) Disequilibrium?Disequilibrium? Strengthened by the similar Strengthened by the similar
rules rules
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Odds Ratio, Yule’s Q, Yule’s Y [Tan & Kumar 2000]Odds Ratio, Yule’s Q, Yule’s Y [Tan & Kumar 2000]
(Close Conviction)
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definitions: Definitions:
Jaccard:Jaccard:
Klosgen:Klosgen:
Semantics: correlation Semantics: correlation Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non LinearNon Linear 0 when independent0 when independent Bounded max value (0 or 1)Bounded max value (0 or 1) Disequilibrium?Disequilibrium? Strengthened by similar rulesStrengthened by similar rules
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Jaccard, Klosgen [Tan & Kumar 2000]Jaccard, Klosgen [Tan & Kumar 2000]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition:Definition:
Semantics: interest?Semantics: interest? Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non LinearNon Linear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium?Disequilibrium?
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Interestingness Weighting Dependency [Gray & Orlowska 1998]Interestingness Weighting Dependency [Gray & Orlowska 1998]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics: information gain provided by X for YSemantics: information gain provided by X for Y Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non linear, entropicNon linear, entropic 0 when independent0 when independent Inclusion? Disequilibrium?Inclusion? Disequilibrium? Strongly SymmetricStrongly Symmetric Low valueLow value
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Mutual Information (Uncertainty) [Tan et al 2002]Mutual Information (Uncertainty) [Tan et al 2002]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics: cross entropy (by Semantics: cross entropy (by mutual information)mutual information)
Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non linear, entropicNon linear, entropic O when Independent + concaveO when Independent + concave Inclusion? Disequilibrium?Inclusion? Disequilibrium? SymmetricSymmetric Low valueLow value Strengthened by the negative Strengthened by the negative
(X(X¬Y)¬Y)
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
J-Measure (*) [Smyth & Goodman 1991][Goodman & Kruskal 1959]J-Measure (*) [Smyth & Goodman 1991][Goodman & Kruskal 1959]
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
Semantics: quadratic Semantics: quadratic entropyentropy
Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non linear, entropicNon linear, entropic 0 when Independent + 0 when Independent +
concaveconcave Inclusion? Inclusion?
Disequilibrium?Disequilibrium? Very SymmetricVery Symmetric Low valueLow value
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Gini IndexGini Index
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition: Definition:
(continuum of measure (continuum of measure between Gini and Chi2) between Gini and Chi2)
Semantics:?Semantics:? Sensitivity: 3 parametersSensitivity: 3 parameters Measuring frequencyMeasuring frequency Non-Linear (Gini-> distance Non-Linear (Gini-> distance
from the Chi-2)from the Chi-2) 0 when independent0 when independent Inclusion? Disequilibrium?Inclusion? Disequilibrium? Not Symmetric -> SymmetricNot Symmetric -> Symmetric
ΔΔαα: Family measures differences : Family measures differences conditioned by a factor conditioned by a factor real real αα (Gini -> Distance from chi2) (Gini -> Distance from chi2)
ΔΔXX(resp. (resp. ΔΔYY): distribution of vector ): distribution of vector X and (resp. Y)X and (resp. Y)
ΔΔxyxy: vector distribution of X and : vector distribution of X and attached Yattached Y
ΔΔXX x x ΔΔYY: vector distribution of : vector distribution of attached X and Y under the attached X and Y under the hypothesis of independencehypothesis of independence
θθ: vector apriori distribution of Y : vector apriori distribution of Y
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
General Measure of Rule Interestingness (*) General Measure of Rule Interestingness (*) [Jaroszewicz & Simovici 2001][Jaroszewicz & Simovici 2001]
Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence
Definition: Definition:
Semantics: number of examples normalized centeredSemantics: number of examples normalized centered Sensitivity: 4 parametersSensitivity: 4 parameters Measurement statistics (numbers)Measurement statistics (numbers) LinearLinear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium? Disequilibrium?
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Lerman Similarity [Lerman 1981]Lerman Similarity [Lerman 1981]
Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence
Definition: Definition:
Semantics: number of normalized counter-examplesSemantics: number of normalized counter-examples Sensitivity: 4 parametersSensitivity: 4 parameters Measurement statistics (numbers)Measurement statistics (numbers) LinearLinear 0 when independent0 when independent Inclusion?Inclusion? Disequilibrium? Disequilibrium?
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Variation: Implication Index [Gras 1996]Variation: Implication Index [Gras 1996]
Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence
Definition: Definition:
(probabilistic modeling, law chi2)(probabilistic modeling, law chi2) Semantics: probability of a Semantics: probability of a
dependence between X and Ydependence between X and Y Sensitivity: 4 parametersSensitivity: 4 parameters Measuring probability, not Measuring probability, not
frequencyfrequency Non Linear + e- toleranceNon Linear + e- tolerance 0 when independent + real0 when independent + real Maximum value boundedMaximum value bounded inclusion? Disequilibrium?inclusion? Disequilibrium? Strongly Symmetric => Coupling Strongly Symmetric => Coupling
measure of interest [Brin et al., measure of interest [Brin et al., 1997] 1997]
Alternative: Report likelihood Alternative: Report likelihood [Ritschard & al., 1998] [Ritschard & al., 1998]
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Lerman Similarity [Lerman 1981]Lerman Similarity [Lerman 1981]
Quadridimensional Measures-IndependenceQuadridimensional Measures-Independence
Definition: Definition:
(probabilistic modeling, law of counter-(probabilistic modeling, law of counter-examples)examples)
Semantics: likely the scarcity of counter-Semantics: likely the scarcity of counter-examples (Statistical astonishment)examples (Statistical astonishment)
Sensitivity: 4 parametersSensitivity: 4 parameters Measuring probability, not frequencyMeasuring probability, not frequency Non Linear + e-toleranceNon Linear + e-tolerance 0.5 when independent + likelihood 0.5 when independent + likelihood Maximum value boundedMaximum value bounded inclusion? Disequilibrium?inclusion? Disequilibrium?
Logic rules:Logic rules: Can be 0Can be 0
Inspired by Link Likelihood [Lerman et al Inspired by Link Likelihood [Lerman et al 1981]1981]
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Intensity of Implication (*)[Gras 1996] (Analysis of Statistical Intensity of Implication (*)[Gras 1996] (Analysis of Statistical Involvement)Involvement)
Modeling:
Binary variables => numerical, ordinal, intervals, fuzzy [Bernadet 2000, Guillaume 2002, ...]
Bulky data: intensity of entropic Involvement [Gras et al. 2001]
Sequences: rules of prediction [Blanchard et al. 2002]
Intensity Of Involvement and Intensity Of Involvement and Analysis Of Implicative StatisticsAnalysis Of Implicative Statistics
Structuring:
Hierarchy implicative (cohesion) [Gras et al. 2001]
Typical, reduction of variables (inertia of Involvement) [Gras et al. 2002]
AR Quality : Objective MeasuresAR Quality : Objective Measures
ExtensionsExtensions
ApplicationsApplications
CHIC (http://www.ardm.asso.fr/CHIC.html) SIPINA (University of Lyon 2) FELIX (PerformanSE SA)
Quadridimensional RulesQuadridimensional Rules Definition: Definition:
Inclusion Rate:Inclusion Rate:
Information:Information: (increases with (increases with ) )
Asymmetric entropy: the entropy H’(Y|X) decreases with p(Y|X)Asymmetric entropy: the entropy H’(Y|X) decreases with p(Y|X) Semantics: Surprising Statistic + inclusion (removal of disequilibrium)Semantics: Surprising Statistic + inclusion (removal of disequilibrium) Sensitivity: 4 parametersSensitivity: 4 parameters Measuring frequency non-probabilisticMeasuring frequency non-probabilistic Non linear + tolerance e- (adjustment of the selectivity with Non linear + tolerance e- (adjustment of the selectivity with αα (ex: (ex:
αα=2)=2) Max 0.5 when independent + realMax 0.5 when independent + real 0 when in disequilibrium0 when in disequilibrium Strengthened by the contra-positiveStrengthened by the contra-positive Maximum value bounded (1) Maximum value bounded (1)
AR Quality : Objective MeasuresAR Quality : Objective Measures
Entropy (*) [Gras et al 2001] (Analysis of Statistical Involvement)Entropy (*) [Gras et al 2001] (Analysis of Statistical Involvement)
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition:Definition:
Information Rate:Information Rate:
Asymmetric Entropy: The entropy Ê(X) with p(X)Asymmetric Entropy: The entropy Ê(X) with p(X) Semantics: Surprise Statistic + inclusion (removal of Semantics: Surprise Statistic + inclusion (removal of
disequilibrium)disequilibrium) Sensitivity: 4 parametersSensitivity: 4 parameters Measuring frequencyMeasuring frequency Non-linear, entropicNon-linear, entropic 0 to independence0 to independence 0 to Imbalance0 to Imbalance Strengthened by the contra-positiveStrengthened by the contra-positive Maximum value bounded (1)Maximum value bounded (1)
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
TIC (*) [Blanchard et al.2004] (Analysis of Statistical Involvement)TIC (*) [Blanchard et al.2004] (Analysis of Statistical Involvement)
Tridimensional Measures-IndependenceTridimensional Measures-Independence
Definition:Definition: Rule: XRule: X11 X X22 X X33 … X … Xp-1p-1 X Xpp Y Y
Information gain provided by theInformation gain provided by theattribute Xi:attribute Xi:
Conditional entropy:Conditional entropy:
Semantics: surprise gain informational resources provided by Semantics: surprise gain informational resources provided by the premisethe premise
Measuring frequencyMeasuring frequency Non-Linear: entropicNon-Linear: entropic Can be used to assess individual contribution of each Can be used to assess individual contribution of each
attribute ...attribute ...
Quality of Rules : Objective MeasuresQuality of Rules : Objective Measures
Surprisingness (*) [Freitas 1998]Surprisingness (*) [Freitas 1998]
Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation
Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate
Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation
Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate
Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation
Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate
Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation
Confidence, PS, Intensity of ImplicationConfidence, PS, Intensity of Implication
TICTICComparison by SimulationComparison by Simulation
Confidence, TIM, J-Measure, Gini IndexConfidence, TIM, J-Measure, Gini Index
TICTICComparison by SimulationComparison by Simulation
Confidence, TIM, J-Measure, Gini IndexConfidence, TIM, J-Measure, Gini Index
TICTICComparison by SimulationComparison by Simulation
Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate
Intensity of InvolvementIntensity of InvolvementComparison by SimulationComparison by Simulation
Confidence, J-Measure, Coverage RateConfidence, J-Measure, Coverage Rate
Synthesis & Comparative Studies Synthesis & Comparative Studies
[Bayardo and Agrawal, 1999]: influence of support[Bayardo and Agrawal, 1999]: influence of support 9 measures, monotonous functions / antitones support, optimization9 measures, monotonous functions / antitones support, optimization
[Hilderman and Hamilton, 2001]: Interest summaries[Hilderman and Hamilton, 2001]: Interest summaries 16 measures, 5 principles of independence, correlation study16 measures, 5 principles of independence, correlation study
[Azé and Kodratoff, 2001]: resistance to noise in the data [Azé and Kodratoff, 2001]: resistance to noise in the data [Tan & Kumar 2000]: interest association rules[Tan & Kumar 2000]: interest association rules
9 symmetric measures, study of the relationship observed between 2 9 symmetric measures, study of the relationship observed between 2 measurements, influence of supportmeasurements, influence of support
[Tan et al., 2002]: association rules interest[Tan et al., 2002]: association rules interest 21 measures symmetrical 8 principles study of correlation, influence the 21 measures symmetrical 8 principles study of correlation, influence the
media media [Gras et al. 04]: interest association rules[Gras et al. 04]: interest association rules
10 criteria 10 criteria [Lenca et al., 2004]: association rules interest[Lenca et al., 2004]: association rules interest
20 measures, 8 criteria for decision support multi-criteria 20 measures, 8 criteria for decision support multi-criteria [Lallich & Teytaud 2004]: association rules interest[Lallich & Teytaud 2004]: association rules interest
15 measures, 10 principles, learning and using the VC-dimension 15 measures, 10 principles, learning and using the VC-dimension
Quality of Rules : Subjective MeasuresQuality of Rules : Subjective Measures
Experimental ResultsExperimental Results
30 Objective Measures30 Objective Measures
Input Data SetsInput Data Sets
Experimental ResultsExperimental ResultsStable Strong Positive CorrelationsStable Strong Positive Correlations
Average CorrelationAverage Correlation
ARVAL ARVAL A workshop for calculating A workshop for calculating
quality measures quality measures for the scientific community for the scientific community
http://www.univ-nantes.fr/arval
Conclusion and OutlooksConclusion and Outlooks Quality = multidimensional concept: Quality = multidimensional concept:
Subjective (maker)Subjective (maker) Interest = changes with the knowledge of the decision-makerInterest = changes with the knowledge of the decision-maker PB1PB1: extract knowledge / objective decision-maker: extract knowledge / objective decision-maker
Objective (data and rules)Objective (data and rules) Interest = on the Hypothetical Data: Inclusion, Independence, Interest = on the Hypothetical Data: Inclusion, Independence,
Imbalance, nuggets, robustness ...Imbalance, nuggets, robustness ... Antagonism Independence / DisequilibriumAntagonism Independence / Disequilibrium Many indices (~ 50!) =>Many indices (~ 50!) =>
PB2PB2: restricted to support / confidence => workshop for : restricted to support / confidence => workshop for calculating indicescalculating indices
PB3PB3: comparative study (properties, simulations) and : comparative study (properties, simulations) and experimental (behavior data): a platform?experimental (behavior data): a platform?
PB4PB4: combining the clues, choose the right index => Decision : combining the clues, choose the right index => Decision SupportSupport
PB5PB5: new clues?: new clues? PB6PB6: What is a good index? (ingredients of quality) : What is a good index? (ingredients of quality)
Perspective (PB1) Perspective (PB1)
Search for Search for knowledgeknowledge
Anthropocentric Anthropocentric approachapproach
Adaptive ExtractionAdaptive Extraction
FFELIX [Lehn et. Al ELIX [Lehn et. Al 1999] 1999]
AR-VIS [Blanchard et AR-VIS [Blanchard et al. 2003]al. 2003]
Ax: Quality Assessment of KnowledgeAx: Quality Assessment of Knowledge
Combining Subjective and Objective Aspects of QualityCombining Subjective and Objective Aspects of Quality
Perspective (PB 2 3 4 5)Perspective (PB 2 3 4 5)
Calculation: ARVAL? (Calculation: ARVAL? (www.polytech.univ-nantes.fr/arvalwww.polytech.univ-nantes.fr/arval)) Analysis: AR-QAT? [Popovici 2003]Analysis: AR-QAT? [Popovici 2003] Decision Support: HERBS? [Lenca et al. 2003] (www-Decision Support: HERBS? [Lenca et al. 2003] (www-
iasc.enst-bretagne.fr/ecd-ind/HERBS)iasc.enst-bretagne.fr/ecd-ind/HERBS)
Ax: Quality Assessment of KnowledgeAx: Quality Assessment of Knowledge
Platform for experimentation, support and a decisionPlatform for experimentation, support and a decision
BibliographyBibliography [Agrawal et al., 1993] R. Agrawal, T. Imielinsky et A. Swami. Mining associations rules between sets of items in large databases. Proc. of
ACM SIGMOD'93, 1993, p. 207-216 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Evaluation de la résistance au bruit de quelques mesures d'extraction de règles
d'association. Extraction des connaissances et apprentissage 1(4), 2001, p. 143-154 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Extraction de « pépites » de connaissances dans les données : une nouvelle approche et
une étude de sensibilité au bruit. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Bayardo & Agrawal, 1999] R.J. Bayardo et R. Agrawal. Mining the most interesting rules. Proc. of the 5th Int. Conf. on Knowledge
Discovery and Data Mining, 1999, p.145-154. [Bernadet 2000] M. Bernardet. Basis of a fuzzy knowledge discovery system. Proc. of Principles of Data Mining and Knowledge
Discovery, LNAI 1510, pages 24-33. Springer, 2000. [Bernard et Charron 1996] J.-M. Bernard et C. Charron. L’analyse implicative bayésienne, une méthode pour l’étude des dépendances
orientées. I. Données binaires, Revue Mathématique Informatique et Sciences Humaines (MISH), vol. 134, 1996, p. 5-38. [Berti-Equille 2004] L. Berti-équille. Etat de l'art sur la qualité des données : un premier pas vers la qualité des connaissances. Rapport
d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Blanchard et al. 2001] J. Blanchard, F. Guillet, et H. Briand. L'intensité d'implication entropique pour la recherche de règles de
prédiction intéressantes dans les séquences de pannes d'ascenseurs. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4):77-88, 2002.
[Blanchard et al. 2003] J. Blanchard, F. Guillet, F. Rantière, H. Briand. Vers une Représentation Graphique en Réalité Virtuelle pour la Fouille Interactive de Règles d’Association. Extraction des Connaissances et Apprentissage (ECA), vol. 17, n°1-2-3, 105-118, 2003. Hermès Science Publication. ISSN 0992-499X, ISBN 2-7462-0631-5
[Blanchard et al. 2003a] J. Blanchard, F. Guillet, H. Briand. Une visualisation orientée qualité pour la fouille anthropocentrée de règles d’association. In Cognito - Cahiers Romans de Sciences Cognitives. A paraître. ISSN 1267-8015
[Blanchard et al. 2003b] J. Blanchard, F. Guillet, H. Briand. A User-driven and Quality oriented Visualiation for Mining Association Rules. In Proc. Of the Third IEEE International Conference on Data Mining, ICDM’2003, Melbourne, Florida, USA, November 19 - 22, 2003.
[Blanchard et al., 2004] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesurer la qualité des règles et de leurs contraposées avec le taux informationnel TIC. EGC2004, RNTI, Cépaduès. 2004 A paraître.
[Blanchard et al., 2004a] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesure de la qualité des règles d'association par l'intensité d'implication entropique. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].
[Breiman & al. 1984] L.Breiman, J. Friedman, R. Olshen and C.Stone. Classification and Regression Trees. Chapman & Hall,1984. [Briand et al. 2004] H. Briand, M. Sebag, G. Gras et F. Guillet (eds). Mesures de Qualité pour la fouille de données. Revue des
Nouvelles Technologies de l’Information, RNTI, Cépaduès, 2004. A paraître. [Brin et al., 1997] S. Brin, R. Motwani and C. Silverstein. Beyond Market Baskets: Generalizing Association Rules to Correlations. In
Proceedings of SIGMOD’97, pages 265-276, AZ, USA, 1997. [Brin et al., 1997b] S. Brin, R. Motwani, J. Ullman et S. Tsur. Dynamic itemset counting and implication rules for market basket data.
Proc. of the Int. Conf. on Management of Data, ACM Press, 1997, p. 255-264.
BibliographyBibliography [Church & Hanks, 1990] K. W. Church et P. Hanks. Word association norms, mutual information and lexicography. Computational
Linguistics, 16(1), 22-29, 1990. [Clark & Robin 1991] Peter Clark and Robin Boswell: Rule Induction with CN2: Some Recent Improvements. In Proceeding of the
European Working Session on Learning EWSL-91, 1991. [Dong & Li, 1998] G. Dong and J. Li. Interestingness of Discovered Association Rules in terms of Neighborhood-Based Unexpectedness.
In X. Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), Melbourne, Australia, April 1998.
[Duval et al. 2004] B. Duval, A. Salleb, C. Vrain. Méthodes et mesures d’intérêt pour l’extraction de règles d’exception. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].
[Fleury 1996] L. Fleury. Découverte de connaissances pour la gestion des ressources humaines. Thèse de doctorat, Université de Nantes, 1996.
[Frawley & Piatetsky-Shapiro 1992] Frawley W. Piatetsky-Shapiro G. and Matheus C., « Knowledge discovery in databases: an overview », AI Magazine, 14(3), 1992, pages 57-70
[Freitas, 1998] A. A. Freitas. On Objective Measures of Rule Suprisingness. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD `98), pages 1-9, Nantes, France, September 1998.
[Freitas, 1999] A. Freitas. On rule interestingness measures. Knowledge-Based Systems Journal 12(5-6), 1999, p. 309-315. [Gago & Bento, 1998 ] P. Gago and C. Bento. A Metric for Selection of the Most Promising Rules. PKDD’98, 1998. [Gray & Orlowska, 1998] B. Gray and M. E. Orlowska. Ccaiia: Clustering Categorical Attributes into Interesting Association Rules. In X.
Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), pages 132 43, Melbourne, Australia, April 1998.
[Goodman & Kruskal 1959] L. A. Goodman andW. H. Kruskal. Measures of Association for Cross Classification, ii: Further discussion and references. Journal of the American Statistical Association, ??? 1959.
[Gras et al. 1995] R. Gras, H. Briand and P. Peter. Structuration sets with implication intensity. Proc. of the Int. Conf. On Ordinal and Symbolic Data Analysis - OSDA 95. Springer, 1995.
[Gras, 1996] R. Gras et coll.. L'implication statistique - Nouvelle méthode exploratoire de données. La pensée sauvage éditions, 1996. [Gras et al. 2001] R. Gras, P. Kuntz, et H. Briand. Les fondements de l'analyse statistique implicative et quelques prolongements pour la
fouille de données. Mathématiques et Sciences Humaines : Numéro spécial Analyse statistique implicative, 1(154-155) :9-29, 2001. [Gras et al. 2001b] R. Gras, P. Kuntz, R. Couturier, et F. Guillet. Une version entropique de l'intensité d'implication pour les corpus
volumineux. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(1-2) :69-80, 2001. [Gras et al. 2002] R. Gras, F. Guillet, et J. Philippe. Réduction des colonnes d'un tableau de données par quasi-équivalence entre
variables. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4) :197-202, 2002. [Gras et al. 2004] R. Gras, R. Couturier, J. Blanchard, H. Briand, P. Kuntz, P. Peter. Quelques critères pour une mesure de la qualité des
règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Guillaume et al. 1998] S. Guillaume, F. Guillet, J. Philippé. Improving the discovery of associations Rules with Intensity of implication.
Proc. of 2nd European Symposium Principles of data Mining and Knowledge Discovery, LNAI 1510, p 318-327. Springer 1998. [Guillaume 2002] S. Guillaume. Discovery of Ordinal Association Rules. M.-S. Cheng, P. S. Yu, B. Liu (Eds.), Proc. Of the 6th Pacific- sia
Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2002, LNCS 2336, pages 322-327 Springer 2002.
BibliographyBibliography [Guillet et al. 1999] F. Guillet, P. Kuntz, et R. Lehn. A genetic algorithm for visualizing networks of association rules. Proc. the 12th Int.
Conf. On Industrial and Engineering Appl. of AI and Expert Systems, LNCS 1611, pages 145-154. Springer 1999 [Guillet 2000] F. Guillet. Mesures de qualité de règles d’association. Cours DEA-ECD. Ecole polytechnique de l’université de Nantes.
2000. [Hilderman & Hamilton, 1998] R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Interestingness Measures: A Survey.
(KDD `98), ??? New-York 1998. [Hilderman et Hamilton, 2001] R. Hilderman et H. Hamilton. Knowledge discovery and measures of interest. Kluwer Academic
publishers, 2001. [Hussain et al. 2001] F. Hussain, H. Liu, E. Suzuki and H. Lu. Exception Rule Mining with a Relative Interestingness Measure. ??? [Jaroszewicz & Simovici, 2001] S. Jaroszewicz et D.A. Simovici. A general measure of rule interestingness. Proc. of the 7th Int. Conf.
on Knowledge Discovery and Data Mining, L.N.C.S. 2168, Springer, 2001, p. 253-265 [Klemettinen et al. 1994] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A. I. Verkamo. Finding Interesting Rules from
Large Sets of Discovered Association Rules. In N. R. Adam, B. K. Bhargava and Y. Yesha, editors, Proc. of the Third International Conf. on Information and Knowledge Management``, pages 401-407, Gaitersburg, Maryland, 1994.
[Kodratoff, 1999] Y. Kodratoff. Comparing Machine Learning and Knowledge Discovery in Databases:An Application to Knowledge Discovery in Texts. Lecture Notes on AI (LNAI)-Tutorial series. 2000.
[Kuntz et al. 2000] P.Kuntz, F.Guillet, R.Lehn and H.Briand. A User-Driven Process for Mining Association Rules. In D. Zighed, J. Komorowski and J.M. Zytkow (Eds.), Principles of Data Mining and Knowledge Discovery (PKDD2000), Lecture Notes in Computer Science, vol. 1910, pages 483-489, 2000. Springer.
[Kodratoff, 2001] Y. Kodratoff. Comparing machine learning and knowledge discovery in databases: an application to knowledge discovery in texts. Machine Learning and Its Applications, Paliouras G., Karkaletsis V., Spyropoulos C.D. (eds.), L.N.C.S. 2049, Springer, 2001, p. 1-21.
[Kuntz et al. 2001] P. Kuntz, F. Guillet, R. Lehn and H. Briand. A user-driven process for mining association rules. Proc. of Principles of Data Mining and Knowledge Discovery, LNAI 1510, pages 483-489. Springer, 2000.
[Kuntz et al. 2001b] P. Kuntz, F. Guillet, R. Lehn, et H. Briand. Vers un processus d'extraction de règles d'association centré sur l'utilisateur. In Cognito, Revue francophone internationale en sciences cognitives, 1(20) :13-26, 2001.
[Lallich et al. 2004] S. Lallich et O. Teytaud . Évaluation et validation de l’intérêt des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].
[Lehn et al. 1999] R.Lehn, F.Guillet, P.Kuntz, H.Briand and J. Philippé. Felix : An interactive rule mining interface in a kdd process. In P. Lenca (editor), Proc. of the 10th Mini-Euro Conference, Human Centered Processes, HCP’99, pages 169-174, Brest, France, September 22-24, 1999.
[Lenca et al. 2004] P. Lenca, P. Meyer, B. Vaillant, P. Picouet, S. Lallich. Evaluation et analyse multi-critères des mesures de qualité des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].
[Lerman et al. 1981] I. C. Lerman, R. Gras et H. Rostam. Elaboration et évaluation d’un indice d’implication pour les données binaires. Revue Mathématiques et Sciences Humaines, 75, p. 5-35, 1981.
[Lerman, 1981] I. C. Lerman. Classification et analyse ordinale des données. Paris, Dunod 1981. [Lerman, 1993] I. C. Lerman. Likelihood linkage analysis classification method, Biochimie 75, p. 379-397, 1993. [Lerman & Azé 2004] I. C. Lerman et J. Azé.Indidice probabiliste discriminant de vraisemblance du lien pour des données volumineuses.
Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].
BibliographyBibliography [Liu et al., 1999] B. Liu, W. Hsu, L. Mun et H. Lee. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge
and Data Engineering 11, 1999, p. 817-832. [Loevinger, 1947] J. Loevinger. A systemic approach to the construction and evaluation of tests of ability. Psychological monographs,
61(4), 1947. [Mannila & Pavlov, 1999] H. Mannila and D. Pavlov. Prediction with Local Patterns using Cross-Entropy. Technical Report, Information
and Computer Science, University of California, Irvine, 1999. [Matheus & Piatetsky-Shapiro, 1996] C. J. Matheus and G. Piatetsky-Shapiro. Selecting and Reporting what is Interesting: The KEFIR
Application to Healthcare data. In U. M. Fayyad, G. Piatetsky-Shapiro, P.Smyth and R. Uthurusamy (eds), Advances in Knowledge Discovery and Data Mining, p. 401-419, 1996. AAAI Press/MIT Press. [Meo 2000] R. Meo. Theory of dependence values, ACM Transactions on Database Systems 5(3), p. 380-406, 2000.
[Padmanabhan et Tuzhilin, 1998] B. Padmanabhan et A. Tuzhilin. A belief-driven method for discovering unexpected patterns. Proc. Of the 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998, p. 94-100.
[Pearson, 1896] K. Pearson. Mathematical contributions to the theory of evolution. III. regression, heredity and panmixia. Philosophical Transactions of the Royal Society, vol. A, 1896.
[Piatestsky-Shapiro, 1991] G. Piatestsky-Shapiro. Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases. Piatetsky-Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 229-248
[Popovici, 2003] E. Popovici. Un atelier pour l'évaluation des indices de qualité. Mémoire de D.E.A. E.C.D., IRIN/Université Lyon2/RACAI Bucarest, Juin 2003
[Ritschard & al., 1998] G. Ritschard, D. A. Zighed and N. Nicoloyannis. Maximiser l`association par agrégation dans un tableau croisé. In J. Zytkow and M. Quafafou, editors, Proc. of the Second European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD `98), Nantes, France, September 1998.
[Sebag et Schoenauer, 1988] M. Sebag et M. Schoenauer. Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases. Proc. of the European Knowledge Acquisition Workshop (EKAW'88), Boose J., Gaines B., Linster M. (eds.), Gesellschaft für Mathematik und Datenverarbeitung mbH, 1988, p. 28.1-28.20.
[Shannon & Weaver, 1949] C.E. Shannon et W. Weaver. The mathematical theory of communication. University of Illinois Press, 1949. [Silbershatz &Tuzhilin,1995] Avi Silberschatz and Alexander Tuzhilin. On Subjective Measures of Interestingness in Knowledge
Discovery, (KD. & DM. `95) ??? , 1995. [Smyth & Goodman, 1991] P. Smyth et R.M. Goodman. Rule induction using information theory. Knowledge Discovery in Databases,
Piatetsky- Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 159-176 [Tan & Kumar 2000] P. Tan, V. Kumar. Interestingness Measures for Association Patterns : A Perspective. Workshop tutorial (KDD
2000). [Tan et al., 2002] P. Tan, V. Kumar et J. Srivastava. Selecting the right interestingness measure for association patterns. Proc. of the 8th
Int. Conf. on Knowledge Discovery and Data Mining, 2002, p. 32-41.