1 a theoretical framework for association mining based on the boolean retrieval model on the boolean...

31
1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

Upload: joseph-ray

Post on 31-Dec-2015

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

1

A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the

Boolean Retrieval Model

Peter Bollmann-Sdorra

Page 2: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

2

Contents

IntroductionBackgroundBoolean Association MiningExpressing item-sets as queriesConclusionsFuture Work

Page 3: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

3

IntroductionResearchers focus on discovering rules in the form of implications between itemsets which have adequate supports.

Having frequent itemsets as both antecedent and precedent parts of rules represent only the simplest form of predicates.

This simplicity is due in part to the lack of a theoretical framework that includes more expressive predicates.

Page 4: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

4

MotivationIn Information retrieval systems, a strong theoretical background gives the user the power to ask more sophisticated and pertinent questions.  Information retrieval and association mining are two complementary processes on the same data records or transactions. In information retrieval, given a query, we need to find the subset of records that matches the query.In contrast, in data mining, we need to find the queries (rules) having adequate number of records that support them.

Page 5: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

5

Proposed Solution

we introduce the theory of association mining that is based on a model of retrieval known as the Boolean Retrieval Model, where

a Boolean query that uses only the AND operator is analogous to an itemset,

a general Boolean query (AND, OR or NOT) has interpretation as a generalized itemset,

notions of support of itemsets and confidence of rules can be dealt with uniformly, and

an event algebra can be defined, involving all possible transaction subsets, to formally obtain a probability space.

Page 6: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

6

BackgroundDeriving association rules from data: Given a set of items I={i1,i2, . . . , in}, and a

set of transactions T = {t1, t2, . . ., tm}, each transaction ti T , such that ti I,

an association rule is defined as X Y, where X I, Y I, and X Y = , describes the existence of a relationship between the two itemsets X and Y.

Page 7: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

7

The percentage of transactions in the

database that contain both X and Y.

Measure for Significance

),()X( YXPYSupport

Page 8: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

8

The percentage of transactions that contain Y

among those transactions containing X.

Measure for Importance

)(/),()X( XPYXPYConfidence

Page 9: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

9

Represents a test of statistical

independence.

Measure for Importance

)()(/),()X YPXPYXPYInterest(

Page 10: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

10

Boolean Association Mining

Given a set of items I = {i1, i2, …, in}, a

transaction t is defined as a subset of items such that t2I, where 2I = {, {i1}, {i2}, …,

{in}, {i1, i2}, …, { i1, i2, …, in}}.  

Let T 2I be a given set of transactions {t1,

t2, …, tm}. Every transaction tT has an

assigned weight w’(t).

Page 11: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

11

Possible Weights

w’(t) = 1, for all transactions t T. w’(t) = f(t), where f(t) is the frequency of transaction t, for all transactions

t T, i.e., how many times the transaction t was repeated in our database.

w’(t) = |t| * g(t) for all transactions t T, where |t| is the cardinality of t, and g(t) could be either one of the weight functions w’(t)’s defined in (i) and (ii). In this case, longer transactions get higher weight.

w’(t) = v(t) * f(t) for all transactions t T, where v(t) could be the sum of the prices or profits of those items in t.

Page 12: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

12

weights w’s are normalized to

and

T't

)'t('w

)t('w)t(w 1)t(w

Tt

Page 13: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

13

Let I = {beer, milk, bread} be the set of all items, where price(beer) = 5, price(milk) = 3, and price(bread) = 2. The set of transactions T is

f(t) is the frequency of transaction t

Example

# t f(t) 1 {beer} 22 2 {milk} 8 3 {bread} 10 4 {beer, bread} 20 5 {milk, bread} 25 6 {beer, milk, bread} 15

Page 14: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

14

Case 1: W’(t) = 1,

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk,bread} w’(t) 1 1 1 1 1 1 w(t)

6

1 6

1 6

1 6

1 6

1 6

1

Page 15: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

15

Case 2: W’(t) = f(t),

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk, bread} w’(t) 22 8 10 20 25 15 w(t) 0.22 0.08 0.1 0.2 0.25 0.15

Page 16: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

16

Case 3: W’(t) = |t| * g(t),

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk,bread} w’(t) 22 8 10 40 50 45 w(t) 0.13 0.05 0.06 0.23 0.27 0.26

Let g(t)=f(t),

Page 17: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

17

Case 4: W’(t) = v(t) * g(t),

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk, bread} Price(t) 5 3 2 7 5 10 w’(t) 110 24 20 140 125 150 w(t) 0.19 0.04 0.04 0.25 0.22 0.26

Let g(t)=f(t) and v(t)=Price(t)

Page 18: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

18

Expressing item-sets as queries (logical expressions)

Definition 1: For a given set of items I, the set Q of all possible queries associated with item-sets created from I is defined as follows. i I i Q, q, q’ Q q q’ Q

These are all.

Page 19: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

19

Definition 2: For any query q Q, the response set of q, RS(q), is defined as follows:

For all atomic i Q, RS(i) = {tT | it}RS (q q’) = RS(q) RS(q’)

Page 20: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

20

Definition 3: Let q = (i1i2…ik) and Aq denote the item-set associated with q; that is, Aq = {i1, i2, …, ik}, the support of Aq is defined as

 

where q = (i1 i2 … ik).

)q(RSt

q )t(w)A(S

Page 21: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

21

Lemma 1:

The support set of Aq ; SS(Aq), equals to RS(q).

Lemma 2:

For queries q, q1, q2 and q3, the following axioms hold: RS(q q) = RS(q)   RS((q1 q2) q3) = RS(q1 (q2 q3))   RS(q1 q2) = RS(q2 q1)  

Page 22: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

22

Example:

RS((x1 x2) (x3 x2)) = RS(x1 x2 x3)  

Page 23: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

23

Definition 4:

For a given set of items I, the set Q* of all possible queries is defined as follows.

i I i Q*,q, q’ Q* q q’ Q*q, q’ Q* q q’ Q*q Q* q Q*

Page 24: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

24

Definition 5:

For any query q Q*, the response set of transactions, R (q) is defined as  

For all i Q*, RS (i) = {tT | it}  RS (q q’) = RS (q) RS (q’)  RS (q q’) = RS (q) RS (q’)  RS (q) = T - RS (q)

Page 25: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

25

Theorem:

If q is a transformation of q’ that is obtained by applying the rules of Boolean algebra, then

RS(q)= RS(q’)   Each q Q* can be considered as a generalized itemset. The itemsets

investigated in earlier works only consider q Q.

Page 26: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

26

Lemma 3:

{RS(q) | q Q*}=2T

Theorem:

(T, 2T, P) is a probability space.

Page 27: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

27

Rules and Their Response Strengths

Definition 6: The confidence of a rule

Aq Aq’ is defined as

Definition 7: The interest of a rule Aq Aq’ is defined as

 

Definition 8: The support of a rule Aq Aq’ is defined as

)'q(R*)q(R

)'qq(R

)q(R

)'qq(R

)'qq(R)A(S q q'A

Page 28: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

28

Lemma 4: For a rule Aq Aq’,

Lemma 5: For a rule Aq Aq’,

)'qq(R)q(R1)A(S q -Aq'

)'qq(R*2)'q(R)q(R1)A(S q --A q'

Page 29: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

29

ConclusionsThe theory of association mining that is based on a model of retrieval known as the Boolean Retrieval Model has been introduced.The framework we develop derives from the observation that information retrieval and association mining are two complementary processes on the same data records or transactions.Based on the theory of Boolean retrieval, we generalize the itemset structure by using all Boolean operators.

Page 30: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

30

Conclusions (cont.)

By introducing the notion of support of generalized itemsets, a uniform measure for both itemsets and rules (generalized itemsets) has been developed.

Support of a generalized itemset is extended to allow transactions to be weighted so that they can contribute to support unequally.  

Page 31: 1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra

31

Future Work

In order to only generate understandable queries, new restrictions or measures, such as, compactness and simplicity, should be introduced. (These restrictions or measures could eliminate a large number of frequent generalized itemsets, many of which could have complex structures.)