www.sti-innsbruck.at © copyright 2008 sti innsbruck rule learning intelligent systems – lecture...

www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at

Rule Learning

Intelligent Systems – Lecture 10

www.sti-innsbruck.at

1) Motivation (what is the problem solved)

2) Technical solution• Definitions• Explanations• Procedures• Illustrations by short examples

3) Illustration by a larger example

4) Extensions (Work only sketched)


Motivation

• Data warehouses and other large scale data collections provide vaste amounts of data, and hide knowledge that is not obviously to discover

• Rule learning (associative rule learning) is a popular means for discovering interesting relations between data sets.

• Rules enable the inference of knowledge and make hidden facts explicit.


Motivating Example

• Association rule learning is very popular in marketing:

– People shopping Coke and beer, are very likely to also buy potato chips: {Coke, beer} ⇒ {potato chips}.

– Retail Advertising and Marketing Association: "For example, convenience stores realized that new fathers came in to buy Pampers and they also bought beer. So where are the Pampers now? Near the beer. It's about getting a feel for what other things you can merchandise“.

• Other popular applications domains include amongst others: Web usage, intrusion detection or bioinformatics.


Definition by Agrawal

• Let I = {i1, i2, ..., in} be a set of binary attributes called items, and let D = {t1, t2, ..., tn} be a set of transactions called database.

• Each transaction (tuple in the database) contains a subset of the items in I.

• A rule is then defined to be of the form A → C (an implication), where A, C ⊆ I and X ∩ Y = ∅.

• The left-hand side (A) is called antecedent, the right-hand side (C) is referred to as consequent.


Example

• [Simple example from Wikipedia about milk, butter, bread and beer]


Significance of Rules

• Association rule learning only makes sense in the context of very large data sets.

• In very large data sets there are obviously hundreds if not thousands of implications discoverable.

• Significance and interest in a rule is therefore an important selection criteria, and only those rules that represent a bigger share of the whole can be considered relevant.

• The support of an itemset A is defined as the proportion of transactions ti ⊆ D that contain A.


Confidence

• The confidence in a rule depends on the support of the itemset A and the support of the union of A and C.

• Confidence: conf(A → C) = supp(A ⋃ C) / supp(A).

• The confidence is an estimated measure of probability, formally known as P(C|A), and gives an indication of the probability that the consequent holds if the antecedent is given.


Finding Rules

• Association rules need to satisfy a given level of confidence, and a given degree of support the same time.

• A two-step process is generally applied to discover rules that satisfy both requirements:

– Mininmal support is used to determine frequent itemsets

– Minimum confidence tresholds are applied to determine the rules.

• The first step is significantly more challenging than the second one!


Finding Frequent Datasets

• The number of possible datasets is given by the powerset of I, and is thus equal to: 2n – 1, with n = | I |.

• Consequently, the number of potential datasets grows exponentially in the size of I.

• Different algorithms allow nontheless to compute the frequent datasets efficiently: Apriori (BFS), Eclat (DFS), FP-Growth (FP-tree).

• These algorithms exploit the downward-closure property (aka anti-monotonicity): frequent itemset have frequent subsets, which results in infrequent itemset to have infrequent supersets.


Alternative Indicators

• All-confidence: all rules that can be derived from itemset A have at least a confidence of

all-confidence(A) = supp(A) / max(supp(a ∈ A)) with max(supp(a ∈ A)) the support of the item with the highest support in A.

• Collective strength is given by cs(A) = (1-v(A))/(1-E[v(A)]) * E[v(A)]/v(A)

with v(Z) the violation rate and E[] the expected value for independent items; the violation rate is defined as the fraction of transactions which contain some of the items in an itemset but not all. Collective strength gives 0 for perfectly negative correlated items, infinity for perfectly positive correlated items, and 1 if the items co-occur as expected under independence.


Alternative Indicators (2)

• Coverage (aka antecedent support) measures how often a rule is applicable in a database:

coverage(A → C) = supp(A) = P(A)

• The conviction of a rule indicates the ratio of appearences of the antecedent A without C being a consequence of A:

conv(A → C) = 1 - supp(C) / 1 - conf(A → C).

• Leverage measures the difference of A and C appearing together in the data set and what would be expected if A and C where statistically dependent:

leverage(A → C) = P(A and C) - (P(A)P(B))


Alternative Indicators (3)

• The lift of a rule is given as the ratio between the confidence and the pure chance of observing an observation, and thus measures how many times more often A and C occur together than expected if they where statistically independent:

lift(A → C) = conf(A → C) / supp(C)

= supp(A ⋃ C) / supp(A) * supp(C).


Decision Trees as Rules

• Rules can be represented by a decision tree that takes the general form

if item1 then subtree1

elseif item2 then subtree2

elseif...

• There are as many leaf nodes in a decision tree, as there are rules, and thus the tree represents a whole set of rules.


Decision-Driven Rules

• The following definition apply to rules that aim to conclude a fact out of a given set of attribute value assignments.

• The decision tree thus takes the following form

if attribute1 = value1 then subtree1

elseif attribute1 = value2 then subtree2

elseif...

• The critical question is then: which attribute should be the first one to evaluate, i.e. which attribute is the most selective determiner and should be the first one in the decision tree.


Entropy

• Entropy is a measure of ‚degree of doubt‘ and is a well-studied concept in information theory.

• Let {c1, c2, ..., cn} be a set of conclusions C of a rule (consequences); let {x1, x2, ..., xn} be a set of possible values of an attribute X.

• The probability that ci is true given that X has value xj is given by p(ci|xj).

• Entropy is then defined as

entropy = - Σ p(ci|xj) log2[p(ci|xj)] for i ∈ 1...|C|

• The logarithm (- log2[p(ci|xj)]) indicates the ‚amount of information‘ that xj has to offer about the conclusion ci.


Most Useful Determiner

• The lower the entropy of xj with respect to C, the more information xj has to offer about C.

• The entropy of an attribute X with respect to C is then given by - Σ p(xj) Σ p(ci|xj) log2[p(ci|xj)].

• The attribute with the lowest entropy is the most useful determiner, as it has the lowest ‚degree of doubt‘.


ID3 Algorithm by Quinlan

1. For each attribute, compute its entropy with respect to the conclusion;

2. Select the attribute with lowest entropy (say X)

3. Divide the data into separate sets so that within a set, X has a fixed value (X=x1 in one set, X=x2 in another...)

4. Build a tree with branches:

if X=x1 then ... (subtree1)if X=x2 then ... (subtree2)...

5. For each subtree, repeat from step 1.

6. At each iteration, one attribute gets removed from consideration. STOP if no more attributes are left, or if all attribute values lead to the same conclusion.


Example

• [Robinson Crusoe example: what is good to eat...]


Counter Example

• Consider the following data:

– X=3, Y=3 ⇒ yes– X=2, Y=1 ⇒ no– X=3, Y=4 ⇒ no– X=1, Y=1 ⇒ yes– X=2, Y=2 ⇒ yes– X=3, Y=2 ⇒ no

• The entropy-based ID3 algorithm is incapable to spot the obvious relationship if X = Y then yes else no, as only one attribute is considered at the time!

www.sti-innsbruck.at © copyright 2008 sti innsbruck rule learning intelligent systems – lecture...

Documents