data mining in clinical databases by using association rules department of computing charles lo
TRANSCRIPT
Data Mining in Clinical Databases by using Association Rules
Department of ComputingCharles Lo
Outline
• What is Association Rule ?• Previous Works• Target Problems• Methodology and Algorithm• Experiment and Discussion• Q & A
What is Association Rule ? (1)It was introduced in “Agrawal, Imielinski, &
Swami 1993”.
Database
A, B C
30% of the transactions that contain A and B also contain C, 5% of all the transactions contain all of them.
What is Association Rule (2)• In a supermarket, 20% of transactions that
contain Coke Cola also contain Pepsi, 3% of all transactions contain both items.– 20% is the confidence of the rule– 3% is the support of the rule
• Association rule can be applied in– Decision Support– Market Strategy– Financial Forecast
Related Work (1)
In 1993, Agrawal, Imielinski and Swami• Generate all significant association rules
between items
• Algorithm Apriori – Pruning Techniques– Buffer management
if support > min supportif confidence
min confidenceSignificant
association rule
Related Work (2)
• Pruning Technique
– Frequency Constraint
• Memory Management– Memory to store any itemset and all its 1-extensions
Related Work (3)
In 1997, Srikant, Vu and Agrawal• Consider constraints that are boolean
expression over the presence or absence of items in the rules
• Incomplete candidate generation
The boolean constraint: (BC) (X Y)
Related Work (4)
• Selected Items approaches1. generate a set of selected items
• for B= (1 2) 3
2. only count candidates that contain selected items
3. Discard frequent itemsets that do not satisfy the boolean expression
1,3 2,3 1,2,3,4,5
any (non-empty) itemset that satisfies B will contain an item from this set
Related Work (5)In 1998, Ng, Lakshmanan, Han and Pang• Achieved a maximized degree of pruning for
different categories of constraints.• Two critical properties to pruning
– Anti-monotonicity – Succinctness
• Algorithm CAP1. Both anti-monotone and succinct2. Succinct but Non-anti-monotone3. Anti-monotone and Non-Succinct4. Non-anti-monotone and Non-succinct
Related Work (6)
• Anti-Monotone Constraint
– S S’ & S satisfied C S’ satisfied C
Domain Constraint Aggregate Constraint
S = v, S v, S v, S V
min(S) v,max(S) v,count(s) v,sum(s) v
Related Work (7)
• Succinct Constraint
– pruning can be done once-and-for-all before any iteration take place
S = v,S v, S v, S V
min(S) v, min(S) v,max(S) v, max(S) v,count(s) v, sum(s) v
Domain Constraint Aggregate Constraint
S V
Target Problems (1)
• Association of quantitative items satisfy a given inequality constraint which are composed of either (+ , -) or (* , /)– ( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C
1. size m2.size n3. + ( * )4. - [ /]5. (<, >, =, ]6.constant C
– (3,2,+,-,>,100)– (1,1,0,/,=,2)
Target Problems (2)
• Temporal aspect of the data
• Hierarchies over the data
A
B
C
A B C
A B
C D
Serial pattern Parallel pattern Sequence pattern
Problem Statement
• V= I1I2, . . . , IM , a set of quantitative items
• T , the transactions of a database D
• t[k] > 0 means t contain item Ik
t[k] = 0 means Ik does not exist
• Association of items which satisfy
( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C
where is + ( * ) , is - [ /] , is (<, >, =, ]and c is a scalar value
Application in Clinical Database• Relationship between the treatments and
clinical diagonsis
– nursing : 100, clinical test : 30, pharmacies : 165, . . .– nursing : 120, injection : 130, pharmacies : 100, . . .– Operation : 220, injection : 542, clinical test : 60, . . .
• (X + Y ) - Z> 100• X / Y = 2
QMIC (1)
• QMIC (Quantitative Mining under Inequality Constraints)– Candidate generation
• reduce the number of itemsets• Max_Min pruning
– Support counting • reduce the iteration of database scanning• Generation sequence
• Memory requirement– limitation of the available memory
QMIC (2)
• Skip generation steps by the pre-defined size m and n
• Generation Steps– Algorithm Apriori : Lk-1 Lk
– Algorithm QMIC : LK/2 Lk
otherwise
even is if
1
2/
1
i
ii
i
m
mm
m S
SSS
QMIC (3)
• Candidate itemsets generation
set candidate into YXinsert
then0) )Y ((X if
Y X, itemsets any twofor
else
set candidate into YXinsert
then) ... and and (x if
Y X, itemsets any twofor
then )1)(( if
,...,,Given
112211
1
21
ii
ii
k
yxyxy
ss
sss
QMIC (4)
• why in this sequence ?– How about using 3, 4 or larger factor ?– Or even the power series ?
• Memory Management
– keep the previous L’s to generate next level of large itemsets
– Only limited memory is available
– In QMIC, only three previous L’s are need in order to generate the next level of large itemsets in the generation sequence.
QMIC (5)
• What is the trade off of generation sequence ?– more number of candidate itemsets– longer process time in pruning
• Max_Min Pruning – involve the inequality constraint to the pruning– Maximum value itemset list (maxlst)
• Sorted list in a descending order according to the maximum value of sum (product)
– Minimum value itemset list (minlst]• Sorted list in an ascending order according to the minimum
value of sum (product)
QMIC (6)
• Max_Pruning = { , >}– A B C
where A = ( Ii1 Ii2 . . . Iim ), B= ( Ij1 Ij2 . . . Ijn )
– Minimum value of A
• Over pruning ?– Using maxlst– Sliding Window with size m+
Window of maxlst1 stop sliding if total sum of inside items is smaller than C
QMIC (7)
• Max_Pruning procedure
ii
iic
icii
ikii
LLc
CsumI
c
IIsum
iimc
IIIlst
form and from
nlarger tha isindex whoseitems theall remove
max until
1;by increase
Repeat c)
... of sum the toequal maxset b)
1)/)((set a)
},...,,{ as max Given the
11
21i
Experiments (1)
• Number of items
0
1
2
3
4
5
6
7
100 500 1000 2000
Number of items
Tim
e (
se
con
ds
)
QMIC
Apriori
Experiments (2)
• Number of transactions
0
100
200
300
400
500
600
700
5000 10000 20000 50000 100000
Number of Transactions
Tim
e (
se
con
ds
)
QMIC
Apriori
Future Plan
• Association Rules of Sequence Patterns– Time constraint
• Association Rules of Multi-layer data