discovering significant association rules

Discovering Significant Association Rules

Dean L. Zeller

Kent State University

CS73015 – Data Mining

Dr. Ruomin Jin

“If you … beat this [cop] long enough, he’ll tell you he started the … Chicago Fire. Now that don’t necessarily make it so!”

-- Nice Guy EddieReservoir Dogs (1995)

Discovering Significant Association Rules Slide 2


Introduction

• Association Rules• Causation vs. Association• Uses of Association Rules• True and False Discoveries• Measures of Interestingness

“We, the members of the data mining community, are doing a serious disservice to ourselves, as well as to the communities we seek to serve, if we present sets of ‘discoveries’ to our clients of which the majority are spurious.”

-- Geoffrey Webb


Association Rules

• Association rule mining is hot new area of programming.

• Statistical measures must be taken to quickly and efficiently evaluate the “interestingness” of a rule (i.e. represent non-trivial correlations).

• Avoid false discoveries


Causation vs. Association

• X Y usually implies a causal relationship. – “X forces a change in Y.”– Causation is complex and difficult to prove

• In rule mining, X Y is an association relationship.– “X is associated with Y.”– Much easier to calculate and prove– Of less interest for medical research than for market research.

• Association rules indicate only the existence of a statistical relationship between X and Y. They do not specify the nature of the relationship.

• Webb (2006) does not address causal relationships. Silverstein and Brin (1998) discuss causal structures.

X YX Y

Z


Causal Relationships

• A causal relationship between X and Y requires three conditions:– Correlation: X is associated with Y

– Temporal priority: X precedes Y

– Non-spuriousness: the correlation between X and Y is not a result of the causal operation of an outside influence, called a confounding variable.

• For further information on causal relationships, see appendix B.

X Y


Association Relationships

• Association:– “item Y is very likely to be present in baskets

containing items X1, … Xm.”

• Main points of interest: – Are X and Y associated?– What is the underlying reason for the association?

• Example:– Does the beer drinker want to eat pretzels?– Does pretzels make one thirsty for beer?– Is there an external force causing customers to purchase

beer and pretzels at the same time? (e.g. football game)

X Y

Z


True vs. False Discoveries

• On some real-world problems there is potential for all ‘discoveries’ to be false unless appropriate safeguards are employed.

• Create definitions, requirements, and formulas for “true” and “false” discoveries based on the data.

• Specify in terms of arbitrary statistical hypothesis tests.

• Provide strict control over the risk of false discoveries.


Problem Statement

• There are different accepted operational definitions of an association rule.– A collection of items that co-occur frequently in data.

• Items: I = {item1, item2, … itemm}

• Data: D = <t1,t2,…tn>, ti I “transactions”

• For purposes of the Webb paper and this presentation, a rule x y is defined as:

x I x is a subset of I

y I y is an element of I

• The hypothesis is that x is associated with y


Uses of Association Rules• Market Research

– Purchasing products in x is associated with a purchase in product y• {cake mix, milk} eggs• {beer} pretzels• {diapers} sleeping pills

• Medical Research– Experiencing conditions x is associated with condition y

• {virus} {sinus infection} {stuffy nose}• {allergy} {irritated nasal passages} {stuffy nose}• {fever, sweat} {lack of sleep} {lower resistance}• {injury, lack of treatment, chronic pain} {swelling}• {swelling} {pain} {swelling} {pain} …

– Discovering significant associations among conditions and symptoms can help to determine the causal relationship.

• Linguistics Research– See assignment


Method Diagram

Data Exploratory Data

Holdout Data

Exploratory Rule

Discovery

CandidateRules

SignificantRules

Statistical Evaluation


Measures of “Interestingness”Exploratory

Rule Discovery

support

confidence

lift leverage

minimum support

constraint

minimum confidence constraint

minimum improvement constraint


Insignificant Rules

• Assume {pregnant} oedema is significant.• Then {pregnant, female} oedema will also be

significant, but does not give any useful information beyond what {pregnant} oedema gave.

• All cases of pregnancy will be female– sup({pregnant} female) = sup({pregnant})– conf({pregnant} female) = 100%

• Insignificant rules are not useful and can be eliminated without loss of generality.

• Insignificant rules can number in the thousands, so eliminating them is important.


Redundant Rules

• Assume dataminer is in no way related to oedema.

• {pregnant, dataminer} oedema– Could represent a strong correlation, the only

difference being a reduction in support and random differences in confidence resulting from sampling error.

• Redundant rules are unproductive are of no interest.


Support

• Number of transactions containing items in x and y

• Range: 0 (no transactions) to n (all transactions)

Introduced by: Agrawal, Imielinski, and Swami (1993)

itxix :)sup(

ii tytxiyx and:)sup(


Support (normalized)

• Percentage of transactions containing items in x and y

• Range: 0 (no transactions) to 1 (all transactions)

• Normalized results for comparison across unequal size datasets.– supn(x,D1) can be compared to supn(x,D2)

ntxix in :)(sup

ntytxiyx iin and:)(sup


Downward Closure Property

• All subsets of a frequent set are also frequent– If A B, then sup(A) sup(B)

because A has fewer members than B.– Thus, if B is frequent, then A is frequent.

• All supersets of an infrequent set are also infrequent– If A B, then sup(A) sup(B)

because A has more members than B.– Thus, if B is infrequent, then A is infrequent.

• Find frequent itemsets by exploiting its downward closure property to prune the search space.


Minimum Support Constraint

• Remove any rules that do not meet a minimum support (minSup).

• Find all rules such that sup(X Y) ≥ minSup• Quickly removes obviously negative rules without

need for complex statistical calculations.– {male} pregnant

• Support is a good first step to reduce dataset to something more manageable. Depending on dataset, a huge percentage of rules are eliminated.

• However, it allows many false discoveries through.


Coverage

• Measure of how often a given rule is applicable within the transaction database.

• y is ignored

• Also has normalized version (range 0..1)

)sup(:)cov( xtxiyx i

)(sup:)(cov xntxiyx nin


Confidence

• Also called “strength”• The ratio of transactions containing x and y to those containing just x.• Percent of transactions with x that also contain y.• Range: 0 (no transactions) to 1 (all transactions) [normalized by definition]• Divide by 0 not a problem provided the minimum support constraint is used

prior to confidence calculation.• Removes a great deal more false discoveries, but does not remove them all.

Introduced by: Agrawal, Imielinski, and Swami (1993)

)(cov

)(sup

)(sup

)(sup

)cov(

)sup(

)sup(

)sup(

:

and:conf

yx

yx

x

yx

yx

yx

x

yx

txi

tytxiyx

n

n

n

n

i

ii


Minimum Confidence Constraint

• Used as a second step after establishing minimum support.

• Produce rules from the frequent itemsets that exceed a minimum confidence threshold.

• Sensitive to the frequency of the consequent (Y). Consequents with higher support will automatically produce higher confidence values even if there exists no association between the items.


Minimum Improvement Constraint

• A measure of unique improvement in confidence over previously calculated confidence measures.

• If conf(xy) is not sufficiently greater than the maximum confidence of the subsets of x, then the rule does not qualify as “interesting.”

• Careful – if the minimum improvement constraint is set high enough to exclude the majority of uninteresting cases, it is also likely to exclude many productive rules.

)}{conf(MAX)conf()imp( yzyxyxxz


Lift

• Also called “improvement”• Ratio of the probability that x and y occur together to the multiple of

the two individual probabilities for x and y.• Measure of what is gained by using the rule to a base rate in which the

rules is not used.• Divide by 0 not a problem provided the minimum support constraint is

used prior to lift calculation.• Range: 1 (independent) to (relationship)

Introduced by: Brin, Motwani, Ullman, and Tsur (1993)

)(sup)(sup

)(sup

)(sup

)(sup)(sup

)(sup

)(conf

)sup()sup(

)(sup

)sup()sup(

)(sup

)sup(

)(conflift

yxn

yx

y

xyx

y

yx

yx

yxn

nyx

yx

ny

yxyx

nn

n

n

n

n

n


Leverage

• Measures the proportion of additional transactions covered by both x and y above those expected if x and y were independent of each other.

• A rule with higher frequency and lower lift may be more interesting than an alternate rule with lower frequency and higher lift.

• Range: negative = independent, positive = relationship

Introduced by: Spiatetsky-Shapiro (1991)

)(sup)(sup)(suplev

)sup()sup()(suplev

yxnyxyxn

yxyxyx

nnnn


Using the interestingness measures

• In most cases, it is sufficient to focus on a combination of support, confidence, and lift or leverage to quantitatively measure the overall “quality” or “interestingness” of the data.

• The real value of a rule depends heavily on the particular domain and research objectives.

• Usefullness and actionability are subjective means to determine the value of a rule. Both are purely subjective measures and are not mathematically defined.


References• Agrawal R., Imielinski, T., and Swami, A. “Mining associations between sets of items

in large databases.” Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD ’93), pages 207-216, Washington DC, May 1993.

• Brin, S., Motwani, R., Ullman, J. D., and Tsur, S. “Dynamic itemset counting and implication rules for market basket data” Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD ’97), pages 207-216, Washington DC, May 1993.

• Silverstein, C., Brin, S., Motwani, R., Ullman, J. “Scalable Techniques for Mining Causal Structures.” Proceedings of the 24th VLDB Conference, pages 594-605, New York City, 1998.

• Spiatetsky-Shapiro, G., “Discovery, analysis, and presentation of strong rules.” Knowledge Discovery in Databases, pages 229-248, 1991.

• Webb, G. I. “Discovering Significant Rules.”KDD ‘06, pages 434-443, Philadelphia, Pennsylvania, August 2006.

• Zeller, R. A. Personal correspondence, October 2006.


Appendix A – Hypothesis Testing

• Stronger filter

• Can focus on independence between x and y, or to test for unproductive rules.

• Compares xy only against the global frequency of y and against each of its immediate generalizations x\{z}y where z x.


Hypothesis Testing

For each rule, calculate a, b, c, and d, as follows:a = |{i: x ti and y ti}| = sup(xy)

number of transactions that contain x and y

b = |{i: x ti and y ti}| number of transactions that contain x but not y

c = |{i: x\{z} ti and y ti and z ti}|

number of transactions that contain y and all the x values other than z but not z

d = |{i: x\{z} ti and y ti and z ti}| number of transactions that contain all the x values other than z but neither y nor z.


Hypothesis Testing

• Calculate p-value according to the following formula:

• Avoids the problem of setting an appropriate minimum improvement constraint.

• Rejects all rules for which there is insufficient evidence that improvement is greater than zero.

),min(

0 )!()!()!()!()!(

)!()!()!()!(cb

i idicibiadcba

dbcadcbap


Appendix B – Causation Requirements

• Correlation

• Temporal priority

• Non-spuriousness


Correlation

• Standard statistical measure to determine association

• Range: – -1 (strong negative)– to 0 (no correlation)– to 1 (strong positive)

• “Correlation does not imply causation.”


Correlation Examples (positive)

3.000002.000001.000000.00000-1.00000-2.00000-3.00000-4.00000

4.00000

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

p = .245

2.500000.00000-2.50000

4.00000

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

p = .510

3210-1-2-3

4

3

2

1

0

-1

-2

-3

p = .000

3.000002.000001.000000.00000-1.00000-2.00000-3.00000-4.00000

4.00000

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

p = .731

3.000002.000001.000000.00000-1.00000-2.00000-3.00000

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

-4.00000

p = .892

3.002.001.000.00-1.00-2.00-3.00-4.00

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

-4.00000

p = 1.000


3210-1-2-3

4

3

2

1

0

-1

-2

-3

Correlation Examples (negative)

4.003.002.001.000.00-1.00-2.00-3.00

4.00000

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

p = -.245

3.002.001.000.00-1.00-2.00-3.00-4.00

2.50000

0.00000

-2.50000

p = -.510p = .000

3.002.001.000.00-1.00-2.00-3.00-4.00

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

-4.00000

p = -.731

3.000002.000001.000000.00000-1.00000-2.00000-3.00000

4.00

3.00

2.00

1.00

0.00

-1.00

-2.00

-3.00

p = -.892

4.003.002.001.000.00-1.00-2.00-3.00

3.00000

2.00000

1.00000

0.00000

-1.00000

-2.00000

-3.00000

-4.00000

p = -1.000


Temporal priority

• X must precede Y.• Easy to measure in some cases.

– “The fever occurred before the chicken pox formed.”

• Difficult to measure in others. – “She bought the milk before the eggs.”

• Impossible in some cases (e.g. anything male)• Simultaneous Reverse Causation

– “Statistical magic” to justify that X causes Y and Y causes X at the same time.

• Important note: the time of measurement is not necessarily the same as time of occurrence.


Non-spurious vs. Spurious

• Non-spurious: the correlation between X and Y is not the result of the causal inference of an external variable.

• Spurious: the correlation between X and Y is the result of the causal inference of an external variable.

X Y

X Y

Z


Spurious Family Circus


Spurious Simpsons• An entertaining demonstration of this fallacy once appeared in an

episode of The Simpsons (Season 7, "Much Apu About Nothing"). The city had just spent millions of dollars creating a highly sophisticated "Bear Patrol" in response to the sighting of a single bear the week before.

Homer: Not a bear in sight. The "Bear Patrol" is working like a charm!

Lisa: That's specious reasoning, Dad.

Homer: [uncomprehendingly] Thanks, honey.

Lisa: By your logic, I could claim that this rock keeps tigers away.

Homer: Hmm. How does it work?

Lisa: It doesn't work. (pause) It's just a stupid rock!

Homer: Uh-huh.

Lisa: But I don't see any tigers around, do you?

Homer: (pause) Lisa, I want to buy your rock.


Spurious Dilbert


Spurious Relationships

• These are all known strong correlations. What is the actual cause of each?– ice-cream sales and drowning occurrences– number of firemen at a fire and dollar value of damage

caused– college students having more sex get better grades– volume of beer purchased at Mardi Gras and volume of

water in the Mississippi River– voters cause more auto-accidents than non-voters– depression causes loneliness vs. loneliness causes

depression


Spurious Relationships• Sleeping with one's shoes on is strongly correlated with waking up with a headache . Therefore,

sleeping with one's shoes on causes headache. The above example commits the correlation implies causation fallacy, as it prematurely concludes that sleeping with one's shoes on causes headache. A more plausible explanation is that both are caused by a third factor, in this case alcohol intoxication, which thereby gives rise to a correlation.

• Young children who sleep with the light on are much more likely to develop myopia in later life. This result of a study at University of Pennsylvania Medical Center was published in the May 13, 1999, issue of Nature and received much coverage at the time in the popular press. However a later study at Ohio State University did not find any link between infants sleeping with the light on and developing myopia but did find a strong link between parental myopia and the development of child myopia and also noted that myopic parents were more likely to leave a light on in their children's bedroom.

• Since the 1950s, both the atmospheric CO2 level and crime levels have increased sharply. Hence, atmospheric CO2 causes crime. The above example arguably makes the mistake of prematurely concluding a causal relationship where the relationship between the variables, if any, is so complex it may be labeled coincidental. The two events have no simple relationship to each other beside the fact that they are occurring at the same time.

• Not eating causes anorexia nervosa. Having the disease Anorexia Nervosa may be the cause of not eating. It is correct that not eating does cause anorexia nervosa, but it can also be claimed that having developed anorexia nervosa causes one not to eat. Empirical evidence would be necessary to make a causative statement.

• Scientific research finds that people who use cannabis (A) have a higher prevalence of psychiatric disorders compared to those who do not (B). This particular correlation is sometimes used to support the theory that the use of cannabis causes a psychiatric disorder (A is the cause of B). Although this may be possible, we cannot automatically discern a cause and effect relationship from research that has only determined people who use cannabis are more likely to develop a psychiatric disorder. From the same research, it can also be the case that (1.) having the predisposition for a psychiatric disorder causes these individuals to use cannabis (B causes A), OR (2.)it may be the case that in the above study some unknown third factor (e.g., poverty) is the actual cause for there being found a higher number of people (compared to the general public) who both use cannabis and who have been diagnosed as having a psychiatric disorder. Alternatively, it may be that the effects of cannabis are found more pleasurable by persons with certain psychiatric disorders. To assume that A causes B is tempting, but further scientific investigation of the type that can isolate extraneous variables is needed when research has only determined a statistical correlation.

Source: Wikipedia

discovering significant association rules

Documents