rough sets association analysis
TRANSCRIPT
-
8/3/2019 Rough Sets Association Analysis
1/14
Approximate Boolean Reasoning 459
Let us note that any i-th degree surface in IRk can be defined as follows:
S =
(x1, . . . , xk) IRk : P (x1, . . . , xk) = 0
,
where P (x1, . . . , xk) is an arbitrary ith degree polynomial over k variables.
Any ith degree polynomial is a linear combination of monomials, each of degreenot greater than i. By (i, k) we denote the number of k-variable monomials ofdegrees i. Then, instead of searching for ith degree surfaces in k-dimensionalaffine real space IRk, one can search for hyperplanes in space IR(i,k).
It is easy to see that the number of jth degree monomials built from k variables
is equal to
j + k 1
k 1
. Then we have
(i, k) =i
j=1
j + k 1k 1
= O ki . (58)As we can see, applying the above surfaces we have better chance to discernobjects from different decision classes with smaller number of cuts. This is be-cause higher degree surfaces are more flexible than normal cuts. This fact can beshown by applying the VC (Vapnik-Chervonenkis) dimension for correspondingset of functions [154].
To search for an optimal set of ith degree surfaces discerning objects fromdifferent decision classes of a given decision table S = (U, A {d}) one can con-struct a new decision table Si = U, Ai {d} where Ai is a set of all monomialsof degree i built on attributes from A. Any hyperplane found for the decisiontable Si is a surface in the original decision table S. The cardinality of Ai isestimated by the formula (58).
Hence, for the better solution, we must pay with the increase of space andtime complexity.
9 Rough Sets and Association Analysis
In this section, we consider a well-known and famous nowadays data miningtechnique, called association rules [3], to discover useful patterns in transactionaldatabases. The problem is to extract all associations and correlations amongdata items where the presence of one set of items in a transaction implies (witha certain degree of confidence) the presence of other items. Besides market basketdata, association analysis is also applicable to other application domains such ascustomer relationship management (CRM), bioinformatics, medical diagnosis,Web mining, and scientific data analysis.
We will point out also the contribution of rough sets and approximate Booleanreasoning approach in association analysis, as well as the correspondence between
the problem of searching for approximate reduct and the problem of generatingassociation rules from frequent item sets.
GROUP 15
-
8/3/2019 Rough Sets Association Analysis
2/14
460 H.S. Nguyen
9.1 Approximate Reducts
Let S = (U, A {dec}) be a given decision table, where U = {u1, u2, . . . , un} andA = {a1, . . . , ak}. Discernibility matrix ofS was defined as the (n n) matrixM(S) = [Mi,j]
ni,j=1 where
Mi,j =
{am A : am(xi) = am(xj)} if dec(xi) = dec(xj) otherwise.
(59)
Let us recall that a set B A of attributes is consistent with dec (ordec-consistent) if B has non-empty intersection with each non-empty set Mi,j,i.e.,
B is consistent with dec iff i,j(Ci,j = ) (B Ci,j = ).
Minimal (with respect to inclusion) dec-consistent sets of attributes are calleddecision reducts.In some applications (see [138], [120]), instead of reducts we prefer to use their
approximations called -reducts, where [0, 1] is a real parameter. A set ofattributes is called -reduct if it is minimal (with respect to inclusion) amongthe sets of attributes B such that
disc(B)
conflict(S)=
|{Mi,j : B Mi,j = }|
|{Ci,j : Ci,j = }| .
If = 1, the notions of an -reduct and a (normal) reduct coincide. One canshow that for a given , problems of searching for shortest -reducts and for all-reducts are also NP-hard [96].
9.2 From Templates to Optimal Association Rules
Let S = (U, A) be an information table. By descriptors (or simple descriptors)we mean the terms of the form (a = v), where a A is an attribute and v Vais a value in the domain of a (see [98]). By template we mean the conjunction ofdescriptors:
T = D1
D2
... Dm
,
where D1, ...Dm are either simple or generalized descriptors. We denote bylength(T) the number of descriptors being in T.
For the given template with length m:
T = (ai1 = v1) ... (aim = vm)
the object u U is said to satisfy the template T if and only ifjaij (u) = vj . Inthis way the template T describes the set of objects having the common property:values of attributes ai1 ,...,aim are equal to v1,...,vm, respectively. In this sense
one can use templates to describe the regularity in data, i.e., patterns - in datamining or granules - in soft computing.
-
8/3/2019 Rough Sets Association Analysis
3/14
Approximate Boolean Reasoning 461
Templates, except for length, are also characterized by their support. Thesupport of a template T is defined by
support(T) = |{u U : u satisfies T}|.
From descriptive point of view, we prefer long templates with large support.The templates that are supported by a predefined number (say min support)
of objects are called the frequent templates. This notion corresponds exactlyto the notion of frequent itemsets for transaction databases [1]. Many efficientalgorithms for frequent itemset generation has been proposed in [1], [3], [2],[161] [44]. The problem of frequent template generation using rough set methodhas been also investigated in [98], [105]. In Sect. 5.4 we considered a specialkind of templates called decision templates or decision rules. Almost all objectssatisfying a decision template should belong to one decision class.
Let us assume that the template T, which is supported by at least s objects,
has been found (using one of existing algorithms for frequent templates). Weassume that T consists of m descriptors i.e.
T = D1 D2 Dm
where Di (for i = 1, . . . , m) is a descriptor of the form (ai = vi) for some ai Aand vi Vai . We denote the set of all descriptors occurring in the template Tby DESC(T), i.e.,
DESC(T) = {D1, D2, . . . , Dm}.
Any set of descriptors P DESC(T) defines an association rule
RP =def
DiP
Di =
Dj /P
Dj
.
The confidence factor of the association rule RP can be redefined as
confidence (RP) =defsupport(T)
support(DiP
Di),
i.e., the ratio of the number of objects satisfying T to the number of objectssatisfying all descriptors from P. The length of the association rule RP is thenumber of descriptors from P.
In practice, we would like to find as many association rules with satisfactoryconfidence as possible (i.e., confidence (RP) c for a given c (0; 1)). Thefollowing property holds for the confidence of association rules:
P1 P2 = confidence (RP1) confidence (RP2) . (60)
This property says that if the association rule RP generated from the descriptorset P has satisfactory confidence then the association rule generated from anysuperset of P also has satisfactory confidence.
For a given confidence threshold c (0; 1] and a given set of descriptorsP DESC(T), the association rule RP is called c-representative if
-
8/3/2019 Rough Sets Association Analysis
4/14
462 H.S. Nguyen
1. confidence (RP) c;2. for any proper subset P P we have confidence (RP) < c.
From Eqn. (60) one can see that instead of searching for all association rules,it is enough to find all c-representative rules. Moreover, every c-representative
association rule covers a family of association rules. The shorter the associationrule R is, the bigger is the set of association rules covered by R. First of all, weshow the following theorem:
Theorem 24. For a fixed real number c (0;1] and a template T, the optimalcassociation rules problem i.e., searching for the shortest c-representativeassociation rule from T in a given table A is NP-hard.
Proof: Obviously, the Optimal cAssociation Rules Problem belongs to NP. Weshow that the Minimal Vertex Covering Problem (which is NP-hard, see e.g.
[35]) can be transformed to the Optimal c-Association Rules Problem.Let the graph G = (V, E) be an instance of the Minimal Vertex Cover Prob-
lem, where V = {v1, v2, . . . vn} and E = {e1, e2, . . . em}. We assume that everyedge ei is represented by two-element set of vertices, i.e., ei = {vi1 , vi2}. We con-struct the corresponding information table (or transaction table) A(G) = (U, A)for the Optimal c-Association Rules Problem as follows:
1. The set U consists of m objects corresponding to m edges of the graph Gand k + 1 objects added for some technical purpose, i.e.,
U = {x1, x2, . . . , xk} {x} {ue1 , ue2 , . . . , uem},
where k =
c1c
is a constant derived from c.
2. The set A consists of n attributes corresponding to n vertices of the graphG and an attribute a added for some technical purpose, i.e.,
A = {av1 , av2 , . . . , avn} {a}.
The value of attribute a A over the object u U is defined as follows:
(a) if u {x1, x2, . . . , xk} then
a(xi) = 1 for any a A.
(b) if u = x then for any j {1, . . . , n}:
avj (x) = 1 and a(x) = 0.
(c) if u {ue1 , ue2 , . . . , uem} then for any j {1, . . . , n}:
avj (uei) =
0 if vj ei1 otherwise and a(uei) = 1.
-
8/3/2019 Rough Sets Association Analysis
5/14
Approximate Boolean Reasoning 463
ExampleLet us consider the Optimal c-Association Rules Problem for c = 0.8. We il-lustrate the proof of Theorem 24 by the graph G = (V, E) with five ver-tices V = {v1, v2, v3, v4, v5} and six edges E = {e1, e2, e3, e4, e5, e6}. First
we compute k = c1c = 4. Hence, the information table A(G) consistsof six attributes {av1 , av2 , av3 , av4 , av5 , a
} and (4 + 1) + 6 = 11 objects{x1, x2, x3, x4, x
, ue1 , ue2 , ue3 , ue4 , ue5 , ue6}. The information table A(G) con-structed from the graph G is presented in the figure below.
v2v1
v
v
v 3
4
5
e
e
e
e
e
e
1
23
6
5
4
=
A(G) av1 av2 av3 av4 av5 a
x1 1 1 1 1 1 1x2 1 1 1 1 1 1
x3 1 1 1 1 1 1
x4 1 1 1 1 1 1
x 1 1 1 1 1 0
ue1 0 0 1 1 1 1ue2 0 1 1 0 1 1ue3 1 0 1 1 0 1
ue4 1 0 1 0 1 1
ue5 0 1 0 1 1 1ue6 1 1 0 1 0 1
Fig. 34. The construction of the information table A(G) from the graph G = (V, E)with five vertices and six edges for c = 0.8
The illustration of our construction is presented in Fig. 34.We will show that any set of vertices W V is a minimal covering set for the
graph G if and only if the set of descriptors
PW = {(avj = 1) : for vj W}
defined by W encodes the shortest c-representative association rule for A(G)from the template
T = (av1 = 1) (avn = 1) (a = 1).
The first implication () is obvious. We show that implication () also holds.The only objects satisfying T are x1, . . . , xk hence we have support(T) = k.
Let P Q be an optimal c-confidence association rule derived from T. Then
we have support(T)support(P) c, hence
support(P) 1
c support(T) =
1
c k =
1
c
c
1 c
1
1 c=
c
1 c+ 1.
Because support(P) is an integer number, we have
support(P)
c1 c
+ 1
=
c1 c
+ 1 = k + 1.
-
8/3/2019 Rough Sets Association Analysis
6/14
464 H.S. Nguyen
Thus, there is at most one object from the set {x}{ue1 , ue2 , . . . , uem} satisfyingthe template P. We consider two cases:
1. The object x satisfies P: then the template P cannot contain the descriptor(a = 1), i.e.,
P = (avi1 = 1) (avit = 1)
and there is no object from {ue1 , ue2 , . . . , uem} which satisfies P, i.e., for anyedge ej E there exists a vertex vi {vi1 , . . . , vit} such that avi(uej ) = 0(which means that vi ej). Hence, the set of vertices W = {vi1 , . . . , vit} Vis a solution of the Minimal Vertex Cover Problem.
2. An object uej satisfies P: then P consists of the descriptor (a = 1); thus
P = (avi1 = 1) (avit = 1) (a = 1).
Let us assume that ej = {vj1 , vj2}. We consider two templates P1, P2 ob-tained from P by replacing the last descriptor by (avj1 = 1) and (avj2 = 1),respectively, i.e.
P1 = (avi1 = 1) (avit = 1) (avj1 = 1)
P2 = (avi1 = 1) (avit = 1) (avj2 = 1).
One can prove that both templates are supported by exactly k objects:x1, x2, . . . , xt and x
. Hence, similarly to the previous case, the two sets ofvertices W1 = {vi1 , . . . , vit , vj1} and W2 = {vi1 , . . . , vit , vj2} establish thesolutions of the Minimal Vertex Cover Problem.
We showed that any instance I of the Minimal Vertex Cover Problem can betransformed to the corresponding instance I of the Optimal cAssociation RuleProblem in polynomial time and any solution ofI can be obtained from solutionsofI. Our reasoning shows that the Optimal cAssociation Rules Problem is NP-hard.
Since the problem of searching for the shortest representative association rulesis NP-hard, the problem of searching for all association rules must be also asleast NP-hard because this is a more complex problem. Having all associationrules one can easily find the shortest representative association rule. Hence, wehave the following:
Theorem 25. The problem of searching for all (representative) associationrules from a given template is at least NP-hard unless P = N P.
The NP-hardness of presented problems forces us to develop efficient approx-
imate algorithms solving them. In the next section we show that they can bedeveloped using rough set methods.
-
8/3/2019 Rough Sets Association Analysis
7/14
Approximate Boolean Reasoning 465
9.3 Searching for Optimal Association Rules by Rough Set Methods
To solve the presented problem, we show that the problem of searching foroptimal association rules from a given template is equivalent to the problem ofsearching for local -reducts for a decision table, which is a well-known problem
in rough set theory. We propose the Boolean reasoning approach for associationrule generation.
Association rule problem (A, T) New decision table A|T
?
Association rules RP1 , . . . ,RPt -reducts P1, . . . , Pt ofA|T
Fig. 35. The Boolean reasoning scheme for association rule generation
We construct a new decision table A|T = (U, A|T d) from the original infor-mation table A and the template T as follows:
A|T = {aD1 , aD2 , . . . , aDm} is a set of attributes corresponding to the de-scriptors of the template T
aDi(u) =
1 if the object u satisfies Di,0 otherwise;
(61)
the decision attribute d determines whether a given object satisfies the tem-plate T, i.e.,
d(u) =
1 if the object u satisfies T,0 otherwise.
(62)
The following theorems describe the relationship between association rules
problem and reduct searching problem.
Theorem 26. For a given information table A = (U, A) and a template T, theset of descriptors P is a reduct inA|T if and only if the rule
DiP
Di
Dj /P
Dj
is 100%-representative association rule from T.
Proof: Any set of descriptors P is a reduct in the decision table A|T if and onlyif every object u with decision 0 is discerned from objects with decision 1 by one
-
8/3/2019 Rough Sets Association Analysis
8/14
466 H.S. Nguyen
of the descriptors from P (i.e., there is at least one 0 in the information vectorinfP(u)). Thus u does not satisfy the template
DiP
Di. Hence
support
DiPDi
= support(T).
The last equality means that DiP
Di
Dj /P
Dj
is 100%-confidence association rule for table A.
Analogously, one can show the following fact:
Theorem 27. For a given information table A = (U, A), a template T, a set of
descriptors P DESC(T), the ruleDiP
Di
Dj /P
Dj
is a c-representative association rule obtained from T if and only if P is a -
reduct ofA|T, where = 1 1
c1
ns1 , n is the total number of objects from U and
s = support(T). In particular, the problem of searching for optimal associationrules can be solved using methods for -reduct finding.
Proof: Assume that support(DiP Di) = s + e, where s = support(T). Thenwe have
confidence
DiP
Di
Dj /P
Dj
= s
s + e c.
This condition is equivalent to
e
1
c 1
s.
Hence, one can evaluate the discernibility degree of P by
disc degree(P) =e
n s
1c 1
s
n s=
1c 1ns
1= 1 .
Thus
= 1 1c
1ns
1.
Searching for minimal -reducts is a well-known problem in the rough set theory.One can show that the problem of searching for shortest -reducts is NP-hard
[96] and the problem of searching for the all -reducts is at least NP-hard. How-ever, there exist many approximate algorithms solving the following problems:
-
8/3/2019 Rough Sets Association Analysis
9/14
Approximate Boolean Reasoning 467
1. Searching for shortest reduct (see [143]);2. Searching for a number of short reducts (see, e.g., [158]);3. Searching for all reducts (see, e.g., [7]).
The algorithms for the first two problems are quite efficient from computationalcomplexity point of view. Moreover, in practical applications, the reducts gen-erated by them are quite closed to the optimal one.
In Sect. 9.3.1, we present some heuristics for these problems in terms of asso-ciation rule generation.
9.3.1 Example
The following example illustrates the main idea of our method. Let us considerthe information table A (Table 18) with 18 objects and 9 attributes.
Assume that the template
T = (a1 = 0) (a3 = 2) (a4 = 1) (a6 = 0) (a8 = 1)
has been extracted from the information table A. One can see that support(T) =10 and length(T) = 5. The new decision table A|T is presented in Table 19.
The discernibility function for decision table A|T is as follows
f(D1, D2, D3, D4, D5) = (D2 D4 D5) (D1 D3 D4) (D2 D3 D4)
(D1 D2 D3 D4) (D1 D3 D5)
(D2 D3 D5) (D3 D4 D5) (D1 D5)
Table 18. The example of information table A and template T with support 10
A a1 a2 a3 a4 a5 a6 a7 a8 a9
u1 0 * 1 1 * 2 * 2 *u2 0 * 2 1 * 0 * 1 *u3 0 * 2 1 * 0 * 1 *u4 0 * 2 1 * 0 * 1 *
u5 1 * 2 2 * 1 * 1 *u6 0 * 1 2 * 1 * 1 *
u7 1 * 1 2 * 1 * 1 *
u8 0 * 2 1 * 0 * 1 *
u9 0 * 2 1 * 0 * 1 *u10 0 * 2 1 * 0 * 1 *u11 1 * 2 2 * 0 * 2 *u12 0 * 3 2 * 0 * 2 *
u13 0 * 2 1 * 0 * 1 *
u14 0 * 2 2 * 2 * 2 *u15 0 * 2 1 * 0 * 1 *
u16 0 * 2 1 * 0 * 1 *u17 0 * 2 1 * 0 * 1 *u18 1 * 2 1 * 0 * 2 *
T 0 * 2 1 * 0 * 1 *
-
8/3/2019 Rough Sets Association Analysis
10/14
468 H.S. Nguyen
Table 19. The new decision table A|T constructed from A and template T
A|T D1 D2 D3 D4 D5 da1 = 0 a3 = 2 a4 = 1 a6 = 0 a8 = 1
u1 1 0 1 0 0 0
u2 1 1 1 1 1 1u3 1 1 1 1 1 1u4 1 1 1 1 1 1u5 0 1 0 0 1 0u6 1 0 0 0 1 0
u7 0 0 0 0 1 0u8 1 1 1 1 1 1
u9 1 1 1 1 1 1u10 1 1 1 1 1 1u11 0 1 0 1 0 0u12 1 0 0 1 0 0u13 1 1 1 1 1 1u14 1 1 0 0 0 0u15 1 1 1 1 1 1
u16 1 1 1 1 1 1u17 1 1 1 1 1 1u18 0 1 1 1 0 0
After the condition presented in Table 20 is simplified, we obtain six reducts for
the decision table A|T.
f(D1, D2, D3, D4, D5) = (D3 D5) (D4 D5) (D1 D2 D3)
(D1 D2 D4) (D1 D2 D5) (D1 D3 D4)
Thus, we have found from the template T six association rules with (100%)-confidence (see Table 20).
For c = 90%, we would like to find -reducts for the decision table A|T, where
= 1 1c 1
ns 1= 0.86.
Hence, we would like to search for a set of descriptors that covers at least
(n s)() = 8 0.86 = 7
elements of discernibility matrix M(A|T). One can see that the following sets ofdescriptors:
{D1, D2}, {D1, D3}, {D1, D4}, {D1, D5}, {D2, D3}, {D2, D5}, {D3, D4}
have non-empty intersection with exactly 7 members of the discernibility matrixM(A|T). Table 20 presents all association rules achieved from those sets.
-
8/3/2019 Rough Sets Association Analysis
11/14
Approximate Boolean Reasoning 469
Table 20. The simplified version of the discernibility matrix M(A|T); representativeassociation rules with (100%)-confidence and representative association rules with atleast (90%)-confidence
M(A|T) u2, u3, u4, u8, u9,u10, u13, u15, u16, u17
u1 D2 D4 D5u5 D1 D3 D4u6 D2 D3 D4u7 D1 D2 D3 D4
u11 D1 D3 D5u12 D2 D3 D5
u14 D3 D4 D5u18 D1 D5
=
100%-representative rules
D3 D5 D1 D2 D4D4 D5 D1 D2 D3D1 D2 D3 D4 D5D1 D2 D4 D3 D5D1 D2 D5 D3 D4D1 D3 D4 D2 D5
90%-representative rulesD1 D2 D3 D4 D5D1 D3 D3 D4 D5D1 D4 D2 D3 D5D1 D5 D2 D3 D4D2 D3 D1 D4 D5D2 D5 D1 D3 D4D3 D4 D1 D2 D5
In Fig. 36, we present the set of all 100%association rules (light gray region)and 90%association rules (dark gray region). The corresponding representativeassociation rules are represented in bold frames.
9.3.2 The Approximate Algorithms
From the previous example it follows that the searching problem for the repre-sentative association rules can be considered as a searching problem in the latticeof attribute subsets (see Fig. 36). In general, there are two searching strategies:bottomup and topdown. The topdown strategystarts with the whole descrip-tor set and tries to go down through the lattice. In every step, we reduce themost superfluous subsets keeping the subsets which most probably can be re-
duced in the next step. Almost all existing methods realize this strategy (e.g.,Apriori algorithm [2]). The advantage of these methods is as follows:
1. They generate all association rules during searching process.2. It is easy to implement them for either parallel or concurrent computer.
But this process can take very long computation time because of NP-hardnessof the problem (see Theorem 25).
The rough set based method realizes the bottomup strategy. We start withthe empty set of descriptors. Here we describe the modified version of greedyheuristics for the decision table A|T. In practice, we do not construct this addi-
tional decision table. The main problem is to compute the occurrence numberof descriptors in the discernibility matrix M(A|T). For any descriptor D, this
-
8/3/2019 Rough Sets Association Analysis
12/14
470 H.S. Nguyen
Algorithm 8. Searching for shortest representative association rule
Input: Information table A, template T, minimal confidence c.Output: Short c-representative association rulebegin1
Set P := ; UP := U ;2
min support := |U| 1c support(T);3Select the descriptor D from DESC(T) \ P which is satisfied by the smallest4number of objects from UP;Set P := P {D};5UP := satisfy(P);6// i.e., set of objects satisfying all descriptors from Pif |UP| > min support then7
GOTO Step 4;8else9
STOP;10end11
end12
D1 D2D3D4 D5
D1
D2
D3D
4
D1
D2
D3
D5
D1
D2
D4
D5
D1
D3
D4D
5D
2D
3D
5D
4
D1 D2 D3 D4 D5
D1D2D3 D1D2D4 D1 D2D3D4 D1D2 D3D5 D1D2D3D5 D3D5D4 D1D4 D2 D3D5 D4D5 D5D4
D1D2 D1D3 D1D4 D1D5 D2D3 D2D4 D2D5 D3D4 D3D5 D4D5
association rules withconfidence = 100%
association rules withconfidence < 90%
association rules withconfidence > 90%
Fig. 36. The illustration of 100% and 90% representative association rules
-
8/3/2019 Rough Sets Association Analysis
13/14
Approximate Boolean Reasoning 471
number is equal to the number of 0 occurring in the column aD representedby this descriptor and it can be computed using simple SQL queries of the form
SELECT COUNT ... WHERE ...
We present two algorithms: the first (Algorithm 8) finds almost the shortest
c-representative association rule. The presented algorithm does not guaranteethat the descriptor set P is c-representative. But one can achieve it by removingfrom P (which is in general small) all unnecessary descriptors.
The second algorithm (Algorithm 9) finds k short c-representative associationrules where k and c are parameters given by the user. This algorithm makes useof the beam search strategy which evolves k most promising nodes at each depthof the searching tree.
Algorithm 9. Searching for k short representative association rules
Input: Information table A, template T, minimal confidence c, number ofrepresentative rules k N.
Output: k short c-representative association rules RP1 , . . . , RPk .begin1
for i := 1 to k do2Set Pi := ;3UPi := U;4
end5Set min support := |U| 1
c support(T);6
Result set := ;7Working set := {P1, . . . , Pk};8Candidate set := ;9for (each Pi Working set) do10
Select k descriptors Di1, . . . , Di
k from DESC(T) \ Pi which is satisfiedby11the smallest number of objects from UPi ;Insert Pi {D
i1}, . . . , Pi {D
i
k} to the Candidate set;12
end13
Select k descriptor sets P
1, . . . , P
k from the Candidate set (if exist) which14are satisfied by smallest number of objects from U;Set Working set := {P
1, . . . , P
k};15for (Pi Working set) do16
Set UPi := satisfy(Pi);17if |UPi | < min support then18
Move Pi from Working set to the Result set;19end20if (|Result set| > k or Working set is empty) then21
STOP;22else23
GOTO Step 9;24end25
end26
end27
-
8/3/2019 Rough Sets Association Analysis
14/14
472 H.S. Nguyen
P1 P2 Pk...
D1 Dk1 1... D1 Dk
2 2... D1 Dkk k...
P1U{D }11 P1U{D }k
1 P2U{D }12 P2U{D }k
2 PkU{D }1k PkU{D }k
k...... ... ...
P'1 P'2 P'k
The candidate set
Old working set
New working set
Fig. 37. The illustration of the k short representative association rules algorithm
10 Rough Set and Boolean Reasoning Approach to
Mining Large Data Sets
Mining large data sets is one of the biggest challenges in KDD. In many practicalapplications, there is a need of data mining algorithms running on terminals of
a clientserver database system where the only access to database (located inthe server) is enabled by SQL queries.
Unfortunately, the proposed so far data mining methods based on rough setsand Boolean reasoning approach are characterized by high computational com-plexity and their straightforward implementations are not applicable for largedata sets. The critical factor for time complexity of algorithms solving the dis-cussed problem is the number of simple SQL queries like
SELECT COUNT FROM aTable WHERE aCondition
In this section, we present some efficient modifications of these methods to solveout this problem. We consider the following issues:
Searching for short reducts from large data sets; Induction of rule based rough classifier from large data sets; Searching for best partitions defined by cuts on continuous attributes; Soft cuts: a new paradigm for discretization problem.
10.1 Searching for Reducts
The application of ABR approach to reduct problem was described in Sect. 5.
We have shown (see Algorithm 2 on page 389) that the greedy heuristic forminimal reduct problem uses only two functions: