data mining in clinical databases by using association rules department of computing charles lo

25
Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Upload: bailey-parris

Post on 14-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Data Mining in Clinical Databases by using Association Rules

Department of ComputingCharles Lo

Page 2: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Outline

• What is Association Rule ?• Previous Works• Target Problems• Methodology and Algorithm• Experiment and Discussion• Q & A

Page 3: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

What is Association Rule ? (1)It was introduced in “Agrawal, Imielinski, &

Swami 1993”.

Database

A, B C

30% of the transactions that contain A and B also contain C, 5% of all the transactions contain all of them.

Page 4: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

What is Association Rule (2)• In a supermarket, 20% of transactions that

contain Coke Cola also contain Pepsi, 3% of all transactions contain both items.– 20% is the confidence of the rule– 3% is the support of the rule

• Association rule can be applied in– Decision Support– Market Strategy– Financial Forecast

Page 5: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Related Work (1)

In 1993, Agrawal, Imielinski and Swami• Generate all significant association rules

between items

• Algorithm Apriori – Pruning Techniques– Buffer management

if support > min supportif confidence

min confidenceSignificant

association rule

Page 6: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Related Work (2)

• Pruning Technique

– Frequency Constraint

• Memory Management– Memory to store any itemset and all its 1-extensions

Page 7: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Related Work (3)

In 1997, Srikant, Vu and Agrawal• Consider constraints that are boolean

expression over the presence or absence of items in the rules

• Incomplete candidate generation

The boolean constraint: (BC) (X Y)

Page 8: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Related Work (4)

• Selected Items approaches1. generate a set of selected items

• for B= (1 2) 3

2. only count candidates that contain selected items

3. Discard frequent itemsets that do not satisfy the boolean expression

1,3 2,3 1,2,3,4,5

any (non-empty) itemset that satisfies B will contain an item from this set

Page 9: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Related Work (5)In 1998, Ng, Lakshmanan, Han and Pang• Achieved a maximized degree of pruning for

different categories of constraints.• Two critical properties to pruning

– Anti-monotonicity – Succinctness

• Algorithm CAP1. Both anti-monotone and succinct2. Succinct but Non-anti-monotone3. Anti-monotone and Non-Succinct4. Non-anti-monotone and Non-succinct

Page 10: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Related Work (6)

• Anti-Monotone Constraint

– S S’ & S satisfied C S’ satisfied C

Domain Constraint Aggregate Constraint

S = v, S v, S v, S V

min(S) v,max(S) v,count(s) v,sum(s) v

Page 11: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Related Work (7)

• Succinct Constraint

– pruning can be done once-and-for-all before any iteration take place

S = v,S v, S v, S V

min(S) v, min(S) v,max(S) v, max(S) v,count(s) v, sum(s) v

Domain Constraint Aggregate Constraint

S V

Page 12: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Target Problems (1)

• Association of quantitative items satisfy a given inequality constraint which are composed of either (+ , -) or (* , /)– ( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C

1. size m2.size n3. + ( * )4. - [ /]5. (<, >, =, ]6.constant C

– (3,2,+,-,>,100)– (1,1,0,/,=,2)

Page 13: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Target Problems (2)

• Temporal aspect of the data

• Hierarchies over the data

A

B

C

A B C

A B

C D

Serial pattern Parallel pattern Sequence pattern

Page 14: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Problem Statement

• V= I1I2, . . . , IM , a set of quantitative items

• T , the transactions of a database D

• t[k] > 0 means t contain item Ik

t[k] = 0 means Ik does not exist

• Association of items which satisfy

( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C

where is + ( * ) , is - [ /] , is (<, >, =, ]and c is a scalar value

Page 15: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Application in Clinical Database• Relationship between the treatments and

clinical diagonsis

– nursing : 100, clinical test : 30, pharmacies : 165, . . .– nursing : 120, injection : 130, pharmacies : 100, . . .– Operation : 220, injection : 542, clinical test : 60, . . .

• (X + Y ) - Z> 100• X / Y = 2

Page 16: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

QMIC (1)

• QMIC (Quantitative Mining under Inequality Constraints)– Candidate generation

• reduce the number of itemsets• Max_Min pruning

– Support counting • reduce the iteration of database scanning• Generation sequence

• Memory requirement– limitation of the available memory

Page 17: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

QMIC (2)

• Skip generation steps by the pre-defined size m and n

• Generation Steps– Algorithm Apriori : Lk-1 Lk

– Algorithm QMIC : LK/2 Lk

otherwise

even is if

1

2/

1

i

ii

i

m

mm

m S

SSS

Page 18: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

QMIC (3)

• Candidate itemsets generation

set candidate into YXinsert

then0) )Y ((X if

Y X, itemsets any twofor

else

set candidate into YXinsert

then) ... and and (x if

Y X, itemsets any twofor

then )1)(( if

,...,,Given

112211

1

21

ii

ii

k

yxyxy

ss

sss

Page 19: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

QMIC (4)

• why in this sequence ?– How about using 3, 4 or larger factor ?– Or even the power series ?

• Memory Management

– keep the previous L’s to generate next level of large itemsets

– Only limited memory is available

– In QMIC, only three previous L’s are need in order to generate the next level of large itemsets in the generation sequence.

Page 20: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

QMIC (5)

• What is the trade off of generation sequence ?– more number of candidate itemsets– longer process time in pruning

• Max_Min Pruning – involve the inequality constraint to the pruning– Maximum value itemset list (maxlst)

• Sorted list in a descending order according to the maximum value of sum (product)

– Minimum value itemset list (minlst]• Sorted list in an ascending order according to the minimum

value of sum (product)

Page 21: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

QMIC (6)

• Max_Pruning = { , >}– A B C

where A = ( Ii1 Ii2 . . . Iim ), B= ( Ij1 Ij2 . . . Ijn )

– Minimum value of A

• Over pruning ?– Using maxlst– Sliding Window with size m+

Window of maxlst1 stop sliding if total sum of inside items is smaller than C

Page 22: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

QMIC (7)

• Max_Pruning procedure

ii

iic

icii

ikii

LLc

CsumI

c

IIsum

iimc

IIIlst

form and from

nlarger tha isindex whoseitems theall remove

max until

1;by increase

Repeat c)

... of sum the toequal maxset b)

1)/)((set a)

},...,,{ as max Given the

11

21i

Page 23: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Experiments (1)

• Number of items

0

1

2

3

4

5

6

7

100 500 1000 2000

Number of items

Tim

e (

se

con

ds

)

QMIC

Apriori

Page 24: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Experiments (2)

• Number of transactions

0

100

200

300

400

500

600

700

5000 10000 20000 50000 100000

Number of Transactions

Tim

e (

se

con

ds

)

QMIC

Apriori

Page 25: Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo

Future Plan

• Association Rules of Sequence Patterns– Time constraint

• Association Rules of Multi-layer data