data mining in clinical databases by using association rules department of computing charles lo

Data Mining in Clinical Databases by using Association Rules

Department of ComputingCharles Lo

Outline

• What is Association Rule ?• Previous Works• Target Problems• Methodology and Algorithm• Experiment and Discussion• Q & A

What is Association Rule ? (1)It was introduced in “Agrawal, Imielinski, &

Swami 1993”.

Database

A, B C

30% of the transactions that contain A and B also contain C, 5% of all the transactions contain all of them.

What is Association Rule (2)• In a supermarket, 20% of transactions that

contain Coke Cola also contain Pepsi, 3% of all transactions contain both items.– 20% is the confidence of the rule– 3% is the support of the rule

• Association rule can be applied in– Decision Support– Market Strategy– Financial Forecast

Related Work (1)

In 1993, Agrawal, Imielinski and Swami• Generate all significant association rules

between items

• Algorithm Apriori – Pruning Techniques– Buffer management

if support > min supportif confidence

min confidenceSignificant

association rule

Related Work (2)

• Pruning Technique

– Frequency Constraint

• Memory Management– Memory to store any itemset and all its 1-extensions

Related Work (3)

In 1997, Srikant, Vu and Agrawal• Consider constraints that are boolean

expression over the presence or absence of items in the rules

• Incomplete candidate generation

The boolean constraint: (BC) (X Y)

Related Work (4)

• Selected Items approaches1. generate a set of selected items

• for B= (1 2) 3

2. only count candidates that contain selected items

3. Discard frequent itemsets that do not satisfy the boolean expression

1,3 2,3 1,2,3,4,5

any (non-empty) itemset that satisfies B will contain an item from this set

Related Work (5)In 1998, Ng, Lakshmanan, Han and Pang• Achieved a maximized degree of pruning for

different categories of constraints.• Two critical properties to pruning

– Anti-monotonicity – Succinctness

• Algorithm CAP1. Both anti-monotone and succinct2. Succinct but Non-anti-monotone3. Anti-monotone and Non-Succinct4. Non-anti-monotone and Non-succinct

Related Work (6)

• Anti-Monotone Constraint

– S S’ & S satisfied C S’ satisfied C

Domain Constraint Aggregate Constraint

S = v, S v, S v, S V

min(S) v,max(S) v,count(s) v,sum(s) v

Related Work (7)

• Succinct Constraint

– pruning can be done once-and-for-all before any iteration take place

S = v,S v, S v, S V

min(S) v, min(S) v,max(S) v, max(S) v,count(s) v, sum(s) v

Domain Constraint Aggregate Constraint

S V

Target Problems (1)

• Association of quantitative items satisfy a given inequality constraint which are composed of either (+ , -) or (* , /)– ( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C

1. size m2.size n3. + ( * )4. - [ /]5. (<, >, =, ]6.constant C

– (3,2,+,-,>,100)– (1,1,0,/,=,2)

Target Problems (2)

• Temporal aspect of the data

• Hierarchies over the data

A

B

C

A B C

A B

C D

Serial pattern Parallel pattern Sequence pattern

Problem Statement

• V= I1I2, . . . , IM , a set of quantitative items

• T , the transactions of a database D

• t[k] > 0 means t contain item Ik

t[k] = 0 means Ik does not exist

• Association of items which satisfy

( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C

where is + ( * ) , is - [ /] , is (<, >, =, ]and c is a scalar value

Application in Clinical Database• Relationship between the treatments and

clinical diagonsis

– nursing : 100, clinical test : 30, pharmacies : 165, . . .– nursing : 120, injection : 130, pharmacies : 100, . . .– Operation : 220, injection : 542, clinical test : 60, . . .

• (X + Y ) - Z> 100• X / Y = 2

QMIC (1)

• QMIC (Quantitative Mining under Inequality Constraints)– Candidate generation

• reduce the number of itemsets• Max_Min pruning

– Support counting • reduce the iteration of database scanning• Generation sequence

• Memory requirement– limitation of the available memory

QMIC (2)

• Skip generation steps by the pre-defined size m and n

• Generation Steps– Algorithm Apriori : Lk-1 Lk

– Algorithm QMIC : LK/2 Lk

otherwise

even is if

1

2/

1

i

ii

i

m

mm

m S

SSS

QMIC (3)

• Candidate itemsets generation

set candidate into YXinsert

then0) )Y ((X if

Y X, itemsets any twofor

else

set candidate into YXinsert

then) ... and and (x if

Y X, itemsets any twofor

then )1)(( if

,...,,Given

112211

1

21

ii

ii

k

yxyxy

ss

sss

QMIC (4)

• why in this sequence ?– How about using 3, 4 or larger factor ?– Or even the power series ?

• Memory Management

– keep the previous L’s to generate next level of large itemsets

– Only limited memory is available

– In QMIC, only three previous L’s are need in order to generate the next level of large itemsets in the generation sequence.

QMIC (5)

• What is the trade off of generation sequence ?– more number of candidate itemsets– longer process time in pruning

• Max_Min Pruning – involve the inequality constraint to the pruning– Maximum value itemset list (maxlst)

• Sorted list in a descending order according to the maximum value of sum (product)

– Minimum value itemset list (minlst]• Sorted list in an ascending order according to the minimum

value of sum (product)

QMIC (6)

• Max_Pruning = { , >}– A B C

where A = ( Ii1 Ii2 . . . Iim ), B= ( Ij1 Ij2 . . . Ijn )

– Minimum value of A

• Over pruning ?– Using maxlst– Sliding Window with size m+

Window of maxlst1 stop sliding if total sum of inside items is smaller than C

QMIC (7)

• Max_Pruning procedure

ii

iic

icii

ikii

LLc

CsumI

c

IIsum

iimc

IIIlst

form and from

nlarger tha isindex whoseitems theall remove

max until

1;by increase

Repeat c)

... of sum the toequal maxset b)

1)/)((set a)

},...,,{ as max Given the

11

21i

Experiments (1)

• Number of items

0

1

2

3

4

5

6

7

100 500 1000 2000

Number of items

Tim

e (

se

con

ds

)

QMIC

Apriori

Experiments (2)

• Number of transactions

0

100

200

300

400

500

600

700

5000 10000 20000 50000 100000

Number of Transactions

Tim

e (

se

con

ds

)

QMIC

Apriori

Future Plan

• Association Rules of Sequence Patterns– Time constraint

• Association Rules of Multi-layer data

data mining in clinical databases by using association rules department of computing charles lo

Documents