rough sets association analysis

Upload: golgeman

Post on 06-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Rough Sets Association Analysis

    1/14

    Approximate Boolean Reasoning 459

    Let us note that any i-th degree surface in IRk can be defined as follows:

    S =

    (x1, . . . , xk) IRk : P (x1, . . . , xk) = 0

    ,

    where P (x1, . . . , xk) is an arbitrary ith degree polynomial over k variables.

    Any ith degree polynomial is a linear combination of monomials, each of degreenot greater than i. By (i, k) we denote the number of k-variable monomials ofdegrees i. Then, instead of searching for ith degree surfaces in k-dimensionalaffine real space IRk, one can search for hyperplanes in space IR(i,k).

    It is easy to see that the number of jth degree monomials built from k variables

    is equal to

    j + k 1

    k 1

    . Then we have

    (i, k) =i

    j=1

    j + k 1k 1

    = O ki . (58)As we can see, applying the above surfaces we have better chance to discernobjects from different decision classes with smaller number of cuts. This is be-cause higher degree surfaces are more flexible than normal cuts. This fact can beshown by applying the VC (Vapnik-Chervonenkis) dimension for correspondingset of functions [154].

    To search for an optimal set of ith degree surfaces discerning objects fromdifferent decision classes of a given decision table S = (U, A {d}) one can con-struct a new decision table Si = U, Ai {d} where Ai is a set of all monomialsof degree i built on attributes from A. Any hyperplane found for the decisiontable Si is a surface in the original decision table S. The cardinality of Ai isestimated by the formula (58).

    Hence, for the better solution, we must pay with the increase of space andtime complexity.

    9 Rough Sets and Association Analysis

    In this section, we consider a well-known and famous nowadays data miningtechnique, called association rules [3], to discover useful patterns in transactionaldatabases. The problem is to extract all associations and correlations amongdata items where the presence of one set of items in a transaction implies (witha certain degree of confidence) the presence of other items. Besides market basketdata, association analysis is also applicable to other application domains such ascustomer relationship management (CRM), bioinformatics, medical diagnosis,Web mining, and scientific data analysis.

    We will point out also the contribution of rough sets and approximate Booleanreasoning approach in association analysis, as well as the correspondence between

    the problem of searching for approximate reduct and the problem of generatingassociation rules from frequent item sets.

    GROUP 15

  • 8/3/2019 Rough Sets Association Analysis

    2/14

    460 H.S. Nguyen

    9.1 Approximate Reducts

    Let S = (U, A {dec}) be a given decision table, where U = {u1, u2, . . . , un} andA = {a1, . . . , ak}. Discernibility matrix ofS was defined as the (n n) matrixM(S) = [Mi,j]

    ni,j=1 where

    Mi,j =

    {am A : am(xi) = am(xj)} if dec(xi) = dec(xj) otherwise.

    (59)

    Let us recall that a set B A of attributes is consistent with dec (ordec-consistent) if B has non-empty intersection with each non-empty set Mi,j,i.e.,

    B is consistent with dec iff i,j(Ci,j = ) (B Ci,j = ).

    Minimal (with respect to inclusion) dec-consistent sets of attributes are calleddecision reducts.In some applications (see [138], [120]), instead of reducts we prefer to use their

    approximations called -reducts, where [0, 1] is a real parameter. A set ofattributes is called -reduct if it is minimal (with respect to inclusion) amongthe sets of attributes B such that

    disc(B)

    conflict(S)=

    |{Mi,j : B Mi,j = }|

    |{Ci,j : Ci,j = }| .

    If = 1, the notions of an -reduct and a (normal) reduct coincide. One canshow that for a given , problems of searching for shortest -reducts and for all-reducts are also NP-hard [96].

    9.2 From Templates to Optimal Association Rules

    Let S = (U, A) be an information table. By descriptors (or simple descriptors)we mean the terms of the form (a = v), where a A is an attribute and v Vais a value in the domain of a (see [98]). By template we mean the conjunction ofdescriptors:

    T = D1

    D2

    ... Dm

    ,

    where D1, ...Dm are either simple or generalized descriptors. We denote bylength(T) the number of descriptors being in T.

    For the given template with length m:

    T = (ai1 = v1) ... (aim = vm)

    the object u U is said to satisfy the template T if and only ifjaij (u) = vj . Inthis way the template T describes the set of objects having the common property:values of attributes ai1 ,...,aim are equal to v1,...,vm, respectively. In this sense

    one can use templates to describe the regularity in data, i.e., patterns - in datamining or granules - in soft computing.

  • 8/3/2019 Rough Sets Association Analysis

    3/14

    Approximate Boolean Reasoning 461

    Templates, except for length, are also characterized by their support. Thesupport of a template T is defined by

    support(T) = |{u U : u satisfies T}|.

    From descriptive point of view, we prefer long templates with large support.The templates that are supported by a predefined number (say min support)

    of objects are called the frequent templates. This notion corresponds exactlyto the notion of frequent itemsets for transaction databases [1]. Many efficientalgorithms for frequent itemset generation has been proposed in [1], [3], [2],[161] [44]. The problem of frequent template generation using rough set methodhas been also investigated in [98], [105]. In Sect. 5.4 we considered a specialkind of templates called decision templates or decision rules. Almost all objectssatisfying a decision template should belong to one decision class.

    Let us assume that the template T, which is supported by at least s objects,

    has been found (using one of existing algorithms for frequent templates). Weassume that T consists of m descriptors i.e.

    T = D1 D2 Dm

    where Di (for i = 1, . . . , m) is a descriptor of the form (ai = vi) for some ai Aand vi Vai . We denote the set of all descriptors occurring in the template Tby DESC(T), i.e.,

    DESC(T) = {D1, D2, . . . , Dm}.

    Any set of descriptors P DESC(T) defines an association rule

    RP =def

    DiP

    Di =

    Dj /P

    Dj

    .

    The confidence factor of the association rule RP can be redefined as

    confidence (RP) =defsupport(T)

    support(DiP

    Di),

    i.e., the ratio of the number of objects satisfying T to the number of objectssatisfying all descriptors from P. The length of the association rule RP is thenumber of descriptors from P.

    In practice, we would like to find as many association rules with satisfactoryconfidence as possible (i.e., confidence (RP) c for a given c (0; 1)). Thefollowing property holds for the confidence of association rules:

    P1 P2 = confidence (RP1) confidence (RP2) . (60)

    This property says that if the association rule RP generated from the descriptorset P has satisfactory confidence then the association rule generated from anysuperset of P also has satisfactory confidence.

    For a given confidence threshold c (0; 1] and a given set of descriptorsP DESC(T), the association rule RP is called c-representative if

  • 8/3/2019 Rough Sets Association Analysis

    4/14

    462 H.S. Nguyen

    1. confidence (RP) c;2. for any proper subset P P we have confidence (RP) < c.

    From Eqn. (60) one can see that instead of searching for all association rules,it is enough to find all c-representative rules. Moreover, every c-representative

    association rule covers a family of association rules. The shorter the associationrule R is, the bigger is the set of association rules covered by R. First of all, weshow the following theorem:

    Theorem 24. For a fixed real number c (0;1] and a template T, the optimalcassociation rules problem i.e., searching for the shortest c-representativeassociation rule from T in a given table A is NP-hard.

    Proof: Obviously, the Optimal cAssociation Rules Problem belongs to NP. Weshow that the Minimal Vertex Covering Problem (which is NP-hard, see e.g.

    [35]) can be transformed to the Optimal c-Association Rules Problem.Let the graph G = (V, E) be an instance of the Minimal Vertex Cover Prob-

    lem, where V = {v1, v2, . . . vn} and E = {e1, e2, . . . em}. We assume that everyedge ei is represented by two-element set of vertices, i.e., ei = {vi1 , vi2}. We con-struct the corresponding information table (or transaction table) A(G) = (U, A)for the Optimal c-Association Rules Problem as follows:

    1. The set U consists of m objects corresponding to m edges of the graph Gand k + 1 objects added for some technical purpose, i.e.,

    U = {x1, x2, . . . , xk} {x} {ue1 , ue2 , . . . , uem},

    where k =

    c1c

    is a constant derived from c.

    2. The set A consists of n attributes corresponding to n vertices of the graphG and an attribute a added for some technical purpose, i.e.,

    A = {av1 , av2 , . . . , avn} {a}.

    The value of attribute a A over the object u U is defined as follows:

    (a) if u {x1, x2, . . . , xk} then

    a(xi) = 1 for any a A.

    (b) if u = x then for any j {1, . . . , n}:

    avj (x) = 1 and a(x) = 0.

    (c) if u {ue1 , ue2 , . . . , uem} then for any j {1, . . . , n}:

    avj (uei) =

    0 if vj ei1 otherwise and a(uei) = 1.

  • 8/3/2019 Rough Sets Association Analysis

    5/14

    Approximate Boolean Reasoning 463

    ExampleLet us consider the Optimal c-Association Rules Problem for c = 0.8. We il-lustrate the proof of Theorem 24 by the graph G = (V, E) with five ver-tices V = {v1, v2, v3, v4, v5} and six edges E = {e1, e2, e3, e4, e5, e6}. First

    we compute k = c1c = 4. Hence, the information table A(G) consistsof six attributes {av1 , av2 , av3 , av4 , av5 , a

    } and (4 + 1) + 6 = 11 objects{x1, x2, x3, x4, x

    , ue1 , ue2 , ue3 , ue4 , ue5 , ue6}. The information table A(G) con-structed from the graph G is presented in the figure below.

    v2v1

    v

    v

    v 3

    4

    5

    e

    e

    e

    e

    e

    e

    1

    23

    6

    5

    4

    =

    A(G) av1 av2 av3 av4 av5 a

    x1 1 1 1 1 1 1x2 1 1 1 1 1 1

    x3 1 1 1 1 1 1

    x4 1 1 1 1 1 1

    x 1 1 1 1 1 0

    ue1 0 0 1 1 1 1ue2 0 1 1 0 1 1ue3 1 0 1 1 0 1

    ue4 1 0 1 0 1 1

    ue5 0 1 0 1 1 1ue6 1 1 0 1 0 1

    Fig. 34. The construction of the information table A(G) from the graph G = (V, E)with five vertices and six edges for c = 0.8

    The illustration of our construction is presented in Fig. 34.We will show that any set of vertices W V is a minimal covering set for the

    graph G if and only if the set of descriptors

    PW = {(avj = 1) : for vj W}

    defined by W encodes the shortest c-representative association rule for A(G)from the template

    T = (av1 = 1) (avn = 1) (a = 1).

    The first implication () is obvious. We show that implication () also holds.The only objects satisfying T are x1, . . . , xk hence we have support(T) = k.

    Let P Q be an optimal c-confidence association rule derived from T. Then

    we have support(T)support(P) c, hence

    support(P) 1

    c support(T) =

    1

    c k =

    1

    c

    c

    1 c

    1

    1 c=

    c

    1 c+ 1.

    Because support(P) is an integer number, we have

    support(P)

    c1 c

    + 1

    =

    c1 c

    + 1 = k + 1.

  • 8/3/2019 Rough Sets Association Analysis

    6/14

    464 H.S. Nguyen

    Thus, there is at most one object from the set {x}{ue1 , ue2 , . . . , uem} satisfyingthe template P. We consider two cases:

    1. The object x satisfies P: then the template P cannot contain the descriptor(a = 1), i.e.,

    P = (avi1 = 1) (avit = 1)

    and there is no object from {ue1 , ue2 , . . . , uem} which satisfies P, i.e., for anyedge ej E there exists a vertex vi {vi1 , . . . , vit} such that avi(uej ) = 0(which means that vi ej). Hence, the set of vertices W = {vi1 , . . . , vit} Vis a solution of the Minimal Vertex Cover Problem.

    2. An object uej satisfies P: then P consists of the descriptor (a = 1); thus

    P = (avi1 = 1) (avit = 1) (a = 1).

    Let us assume that ej = {vj1 , vj2}. We consider two templates P1, P2 ob-tained from P by replacing the last descriptor by (avj1 = 1) and (avj2 = 1),respectively, i.e.

    P1 = (avi1 = 1) (avit = 1) (avj1 = 1)

    P2 = (avi1 = 1) (avit = 1) (avj2 = 1).

    One can prove that both templates are supported by exactly k objects:x1, x2, . . . , xt and x

    . Hence, similarly to the previous case, the two sets ofvertices W1 = {vi1 , . . . , vit , vj1} and W2 = {vi1 , . . . , vit , vj2} establish thesolutions of the Minimal Vertex Cover Problem.

    We showed that any instance I of the Minimal Vertex Cover Problem can betransformed to the corresponding instance I of the Optimal cAssociation RuleProblem in polynomial time and any solution ofI can be obtained from solutionsofI. Our reasoning shows that the Optimal cAssociation Rules Problem is NP-hard.

    Since the problem of searching for the shortest representative association rulesis NP-hard, the problem of searching for all association rules must be also asleast NP-hard because this is a more complex problem. Having all associationrules one can easily find the shortest representative association rule. Hence, wehave the following:

    Theorem 25. The problem of searching for all (representative) associationrules from a given template is at least NP-hard unless P = N P.

    The NP-hardness of presented problems forces us to develop efficient approx-

    imate algorithms solving them. In the next section we show that they can bedeveloped using rough set methods.

  • 8/3/2019 Rough Sets Association Analysis

    7/14

    Approximate Boolean Reasoning 465

    9.3 Searching for Optimal Association Rules by Rough Set Methods

    To solve the presented problem, we show that the problem of searching foroptimal association rules from a given template is equivalent to the problem ofsearching for local -reducts for a decision table, which is a well-known problem

    in rough set theory. We propose the Boolean reasoning approach for associationrule generation.

    Association rule problem (A, T) New decision table A|T

    ?

    Association rules RP1 , . . . ,RPt -reducts P1, . . . , Pt ofA|T

    Fig. 35. The Boolean reasoning scheme for association rule generation

    We construct a new decision table A|T = (U, A|T d) from the original infor-mation table A and the template T as follows:

    A|T = {aD1 , aD2 , . . . , aDm} is a set of attributes corresponding to the de-scriptors of the template T

    aDi(u) =

    1 if the object u satisfies Di,0 otherwise;

    (61)

    the decision attribute d determines whether a given object satisfies the tem-plate T, i.e.,

    d(u) =

    1 if the object u satisfies T,0 otherwise.

    (62)

    The following theorems describe the relationship between association rules

    problem and reduct searching problem.

    Theorem 26. For a given information table A = (U, A) and a template T, theset of descriptors P is a reduct inA|T if and only if the rule

    DiP

    Di

    Dj /P

    Dj

    is 100%-representative association rule from T.

    Proof: Any set of descriptors P is a reduct in the decision table A|T if and onlyif every object u with decision 0 is discerned from objects with decision 1 by one

  • 8/3/2019 Rough Sets Association Analysis

    8/14

    466 H.S. Nguyen

    of the descriptors from P (i.e., there is at least one 0 in the information vectorinfP(u)). Thus u does not satisfy the template

    DiP

    Di. Hence

    support

    DiPDi

    = support(T).

    The last equality means that DiP

    Di

    Dj /P

    Dj

    is 100%-confidence association rule for table A.

    Analogously, one can show the following fact:

    Theorem 27. For a given information table A = (U, A), a template T, a set of

    descriptors P DESC(T), the ruleDiP

    Di

    Dj /P

    Dj

    is a c-representative association rule obtained from T if and only if P is a -

    reduct ofA|T, where = 1 1

    c1

    ns1 , n is the total number of objects from U and

    s = support(T). In particular, the problem of searching for optimal associationrules can be solved using methods for -reduct finding.

    Proof: Assume that support(DiP Di) = s + e, where s = support(T). Thenwe have

    confidence

    DiP

    Di

    Dj /P

    Dj

    = s

    s + e c.

    This condition is equivalent to

    e

    1

    c 1

    s.

    Hence, one can evaluate the discernibility degree of P by

    disc degree(P) =e

    n s

    1c 1

    s

    n s=

    1c 1ns

    1= 1 .

    Thus

    = 1 1c

    1ns

    1.

    Searching for minimal -reducts is a well-known problem in the rough set theory.One can show that the problem of searching for shortest -reducts is NP-hard

    [96] and the problem of searching for the all -reducts is at least NP-hard. How-ever, there exist many approximate algorithms solving the following problems:

  • 8/3/2019 Rough Sets Association Analysis

    9/14

    Approximate Boolean Reasoning 467

    1. Searching for shortest reduct (see [143]);2. Searching for a number of short reducts (see, e.g., [158]);3. Searching for all reducts (see, e.g., [7]).

    The algorithms for the first two problems are quite efficient from computationalcomplexity point of view. Moreover, in practical applications, the reducts gen-erated by them are quite closed to the optimal one.

    In Sect. 9.3.1, we present some heuristics for these problems in terms of asso-ciation rule generation.

    9.3.1 Example

    The following example illustrates the main idea of our method. Let us considerthe information table A (Table 18) with 18 objects and 9 attributes.

    Assume that the template

    T = (a1 = 0) (a3 = 2) (a4 = 1) (a6 = 0) (a8 = 1)

    has been extracted from the information table A. One can see that support(T) =10 and length(T) = 5. The new decision table A|T is presented in Table 19.

    The discernibility function for decision table A|T is as follows

    f(D1, D2, D3, D4, D5) = (D2 D4 D5) (D1 D3 D4) (D2 D3 D4)

    (D1 D2 D3 D4) (D1 D3 D5)

    (D2 D3 D5) (D3 D4 D5) (D1 D5)

    Table 18. The example of information table A and template T with support 10

    A a1 a2 a3 a4 a5 a6 a7 a8 a9

    u1 0 * 1 1 * 2 * 2 *u2 0 * 2 1 * 0 * 1 *u3 0 * 2 1 * 0 * 1 *u4 0 * 2 1 * 0 * 1 *

    u5 1 * 2 2 * 1 * 1 *u6 0 * 1 2 * 1 * 1 *

    u7 1 * 1 2 * 1 * 1 *

    u8 0 * 2 1 * 0 * 1 *

    u9 0 * 2 1 * 0 * 1 *u10 0 * 2 1 * 0 * 1 *u11 1 * 2 2 * 0 * 2 *u12 0 * 3 2 * 0 * 2 *

    u13 0 * 2 1 * 0 * 1 *

    u14 0 * 2 2 * 2 * 2 *u15 0 * 2 1 * 0 * 1 *

    u16 0 * 2 1 * 0 * 1 *u17 0 * 2 1 * 0 * 1 *u18 1 * 2 1 * 0 * 2 *

    T 0 * 2 1 * 0 * 1 *

  • 8/3/2019 Rough Sets Association Analysis

    10/14

    468 H.S. Nguyen

    Table 19. The new decision table A|T constructed from A and template T

    A|T D1 D2 D3 D4 D5 da1 = 0 a3 = 2 a4 = 1 a6 = 0 a8 = 1

    u1 1 0 1 0 0 0

    u2 1 1 1 1 1 1u3 1 1 1 1 1 1u4 1 1 1 1 1 1u5 0 1 0 0 1 0u6 1 0 0 0 1 0

    u7 0 0 0 0 1 0u8 1 1 1 1 1 1

    u9 1 1 1 1 1 1u10 1 1 1 1 1 1u11 0 1 0 1 0 0u12 1 0 0 1 0 0u13 1 1 1 1 1 1u14 1 1 0 0 0 0u15 1 1 1 1 1 1

    u16 1 1 1 1 1 1u17 1 1 1 1 1 1u18 0 1 1 1 0 0

    After the condition presented in Table 20 is simplified, we obtain six reducts for

    the decision table A|T.

    f(D1, D2, D3, D4, D5) = (D3 D5) (D4 D5) (D1 D2 D3)

    (D1 D2 D4) (D1 D2 D5) (D1 D3 D4)

    Thus, we have found from the template T six association rules with (100%)-confidence (see Table 20).

    For c = 90%, we would like to find -reducts for the decision table A|T, where

    = 1 1c 1

    ns 1= 0.86.

    Hence, we would like to search for a set of descriptors that covers at least

    (n s)() = 8 0.86 = 7

    elements of discernibility matrix M(A|T). One can see that the following sets ofdescriptors:

    {D1, D2}, {D1, D3}, {D1, D4}, {D1, D5}, {D2, D3}, {D2, D5}, {D3, D4}

    have non-empty intersection with exactly 7 members of the discernibility matrixM(A|T). Table 20 presents all association rules achieved from those sets.

  • 8/3/2019 Rough Sets Association Analysis

    11/14

    Approximate Boolean Reasoning 469

    Table 20. The simplified version of the discernibility matrix M(A|T); representativeassociation rules with (100%)-confidence and representative association rules with atleast (90%)-confidence

    M(A|T) u2, u3, u4, u8, u9,u10, u13, u15, u16, u17

    u1 D2 D4 D5u5 D1 D3 D4u6 D2 D3 D4u7 D1 D2 D3 D4

    u11 D1 D3 D5u12 D2 D3 D5

    u14 D3 D4 D5u18 D1 D5

    =

    100%-representative rules

    D3 D5 D1 D2 D4D4 D5 D1 D2 D3D1 D2 D3 D4 D5D1 D2 D4 D3 D5D1 D2 D5 D3 D4D1 D3 D4 D2 D5

    90%-representative rulesD1 D2 D3 D4 D5D1 D3 D3 D4 D5D1 D4 D2 D3 D5D1 D5 D2 D3 D4D2 D3 D1 D4 D5D2 D5 D1 D3 D4D3 D4 D1 D2 D5

    In Fig. 36, we present the set of all 100%association rules (light gray region)and 90%association rules (dark gray region). The corresponding representativeassociation rules are represented in bold frames.

    9.3.2 The Approximate Algorithms

    From the previous example it follows that the searching problem for the repre-sentative association rules can be considered as a searching problem in the latticeof attribute subsets (see Fig. 36). In general, there are two searching strategies:bottomup and topdown. The topdown strategystarts with the whole descrip-tor set and tries to go down through the lattice. In every step, we reduce themost superfluous subsets keeping the subsets which most probably can be re-

    duced in the next step. Almost all existing methods realize this strategy (e.g.,Apriori algorithm [2]). The advantage of these methods is as follows:

    1. They generate all association rules during searching process.2. It is easy to implement them for either parallel or concurrent computer.

    But this process can take very long computation time because of NP-hardnessof the problem (see Theorem 25).

    The rough set based method realizes the bottomup strategy. We start withthe empty set of descriptors. Here we describe the modified version of greedyheuristics for the decision table A|T. In practice, we do not construct this addi-

    tional decision table. The main problem is to compute the occurrence numberof descriptors in the discernibility matrix M(A|T). For any descriptor D, this

  • 8/3/2019 Rough Sets Association Analysis

    12/14

    470 H.S. Nguyen

    Algorithm 8. Searching for shortest representative association rule

    Input: Information table A, template T, minimal confidence c.Output: Short c-representative association rulebegin1

    Set P := ; UP := U ;2

    min support := |U| 1c support(T);3Select the descriptor D from DESC(T) \ P which is satisfied by the smallest4number of objects from UP;Set P := P {D};5UP := satisfy(P);6// i.e., set of objects satisfying all descriptors from Pif |UP| > min support then7

    GOTO Step 4;8else9

    STOP;10end11

    end12

    D1 D2D3D4 D5

    D1

    D2

    D3D

    4

    D1

    D2

    D3

    D5

    D1

    D2

    D4

    D5

    D1

    D3

    D4D

    5D

    2D

    3D

    5D

    4

    D1 D2 D3 D4 D5

    D1D2D3 D1D2D4 D1 D2D3D4 D1D2 D3D5 D1D2D3D5 D3D5D4 D1D4 D2 D3D5 D4D5 D5D4

    D1D2 D1D3 D1D4 D1D5 D2D3 D2D4 D2D5 D3D4 D3D5 D4D5

    association rules withconfidence = 100%

    association rules withconfidence < 90%

    association rules withconfidence > 90%

    Fig. 36. The illustration of 100% and 90% representative association rules

  • 8/3/2019 Rough Sets Association Analysis

    13/14

    Approximate Boolean Reasoning 471

    number is equal to the number of 0 occurring in the column aD representedby this descriptor and it can be computed using simple SQL queries of the form

    SELECT COUNT ... WHERE ...

    We present two algorithms: the first (Algorithm 8) finds almost the shortest

    c-representative association rule. The presented algorithm does not guaranteethat the descriptor set P is c-representative. But one can achieve it by removingfrom P (which is in general small) all unnecessary descriptors.

    The second algorithm (Algorithm 9) finds k short c-representative associationrules where k and c are parameters given by the user. This algorithm makes useof the beam search strategy which evolves k most promising nodes at each depthof the searching tree.

    Algorithm 9. Searching for k short representative association rules

    Input: Information table A, template T, minimal confidence c, number ofrepresentative rules k N.

    Output: k short c-representative association rules RP1 , . . . , RPk .begin1

    for i := 1 to k do2Set Pi := ;3UPi := U;4

    end5Set min support := |U| 1

    c support(T);6

    Result set := ;7Working set := {P1, . . . , Pk};8Candidate set := ;9for (each Pi Working set) do10

    Select k descriptors Di1, . . . , Di

    k from DESC(T) \ Pi which is satisfiedby11the smallest number of objects from UPi ;Insert Pi {D

    i1}, . . . , Pi {D

    i

    k} to the Candidate set;12

    end13

    Select k descriptor sets P

    1, . . . , P

    k from the Candidate set (if exist) which14are satisfied by smallest number of objects from U;Set Working set := {P

    1, . . . , P

    k};15for (Pi Working set) do16

    Set UPi := satisfy(Pi);17if |UPi | < min support then18

    Move Pi from Working set to the Result set;19end20if (|Result set| > k or Working set is empty) then21

    STOP;22else23

    GOTO Step 9;24end25

    end26

    end27

  • 8/3/2019 Rough Sets Association Analysis

    14/14

    472 H.S. Nguyen

    P1 P2 Pk...

    D1 Dk1 1... D1 Dk

    2 2... D1 Dkk k...

    P1U{D }11 P1U{D }k

    1 P2U{D }12 P2U{D }k

    2 PkU{D }1k PkU{D }k

    k...... ... ...

    P'1 P'2 P'k

    The candidate set

    Old working set

    New working set

    Fig. 37. The illustration of the k short representative association rules algorithm

    10 Rough Set and Boolean Reasoning Approach to

    Mining Large Data Sets

    Mining large data sets is one of the biggest challenges in KDD. In many practicalapplications, there is a need of data mining algorithms running on terminals of

    a clientserver database system where the only access to database (located inthe server) is enabled by SQL queries.

    Unfortunately, the proposed so far data mining methods based on rough setsand Boolean reasoning approach are characterized by high computational com-plexity and their straightforward implementations are not applicable for largedata sets. The critical factor for time complexity of algorithms solving the dis-cussed problem is the number of simple SQL queries like

    SELECT COUNT FROM aTable WHERE aCondition

    In this section, we present some efficient modifications of these methods to solveout this problem. We consider the following issues:

    Searching for short reducts from large data sets; Induction of rule based rough classifier from large data sets; Searching for best partitions defined by cuts on continuous attributes; Soft cuts: a new paradigm for discretization problem.

    10.1 Searching for Reducts

    The application of ABR approach to reduct problem was described in Sect. 5.

    We have shown (see Algorithm 2 on page 389) that the greedy heuristic forminimal reduct problem uses only two functions: