mining optimized gain rules

Upload: patricio-manuel

Post on 03-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Mining Optimized Gain Rules

    1/15

  • 8/12/2019 Mining Optimized Gain Rules

    2/15

    . Gain of R is maximized and confidence of R is atleast the user-specified minimum confidence (re-ferred to as the optimized gainrule).

    Optimized association rules are useful for unravelingranges for numeric attributes where certain trends orcorrelations are strong (that is, have high support, con-fidence, or gain). For example, suppose the telecom service

    provider mentioned earlier was interested in offering apromotion to NY customers who make calls to France. Inthis case, the timing of the promotion may be criticalforits success, it would be advantageous to offer it close to aperiod of consecutive days in which the percentage of callsfrom NY that are directed to France is maximum. Theframework developed in [6] can be used to determine suchperiods. Consider, for example, the association rule

    date2 l1; u1 ^ src city NY ! dst country France:

    With a minimum confidence of 0.5, the optimized gain ruleresults in the period in which the calls from NYto Franceexceeds 50 percent of the total calls from NY and,furthermore, the number of these excess calls is maximum.

    A limitation of the optimized association rules dealt within [6] is that only a single optimal interval for a singlenumeric attribute can be determined. However, in anumber of applications, a single interval may be aninadequate description of local trends in the underlyingdata. For example, suppose the telecom service provider isinterested in doing up tok promotions for customers inNYcalling France. For this purpose, we need a mechanism toidentify up to k periods during which a sizeable fraction ofcalls are from NY to France. If association rules werepermitted to contain disjunctions of uninstantiated condi-tions, then we could determine the optimal k (or fewer)periods by finding optimal instantiations for the rule:

    date2 l1; u1 _ _ date2 lk; uk ^ src city NY

    !dst country France:

    This information can be used by the telecom serviceprovider to determine the most suitable periods for offeringdiscounts on international long distance calls toFrance. Theabove framework can be further strengthened by enrichingassociation rules to contain more than one uninstantiatedattribute, as is done in [7]. Thus, optimal instantiations forthe rule

    date2 l1; u1 ^ duration2 l1l1; u1u1 _ _ date2 lk; uk^ duration2 lklk; ukuk ! dst country France

    would yield valuable information about types of calls (interms of their duration) and periods in which a substantialportion of the call volume is directed toFrance.

    1.2 Our Contributions

    In this paper, we consider the generalized optimized gainproblem. Unlike [6] and [7], we permit rules to contain up tok disjunctions over one or two uninstantiated numericattributes. Thus, unlike [6] and [7], that only compute asingle optimal region, our generalized rules enable up to

    k optimal regions to be computed.

    Furthermore, unlike [15], in which we only addressedthe optimized support problem, in this paper, we focus onthe optimized gain problem and consider both the one andtwo attribute cases. In addition, for rules containing a singlenumeric attribute, we develop an algorithm for computingthe optimized gain rule whose complexity isOnk, wherenis the number of values in the domain of the uninstantiated

    attribute (the dynamic programming algorithm for opti-mized support that we presented in [15] had complexityOn2k). We also propose a bucketing optimization that canresult in significant reductions in input size by coalescingcontiguous values. For two numeric attributes, we present adynamic programming algorithm that computes approx-imate association rules. Using recent results on binary spacepartitioning trees, we show that, for the optimized gaincase, the approximations are within a constant factor (of 1

    4)

    of the optimal solution. Our experimental results withsynthetic data sets for a single numeric attribute demon-strate that our algorithms scale up linearly with theattributes domain size as well as the number of disjunc-

    tions. In addition, we show that applying our optimizedrule framework to a population survey real-life data setenables us to discover interesting underlying correlationsamong the attributes.

    The remainder of the paper is organized as follows: InSection 2, we discuss related work and, in Section 3, weintroduce the necessary definitions and problem formula-tion for the optimized gain problem. We present our lineartime complexity algorithm for computing the optimizedgain rule for a single numeric attribute in Section 4. InSection 5, we develop a dynamic programming algorithmfor two numeric attributes and show that the computedgain is within a constant factor of the optimal. We presentthe results of our experiments with synthetic and real-lifedata sets in Section 6. Finally, we offer concluding remarksin Section 7.

    2 RELATED WORK

    In [15], we generalized the optimized association rulesproblem for support, described in [6]. We allowed associa-tion rules to contain up to k disjunctions over oneuninstantiated numeric attribute. For one attribute, wepresented a dynamic programming algorithm for computingthe optimized support rule and whose complexity isOn2k, where n is the number of values in the domain ofthe uninstantiated attribute. In [14], we considered a

    different formulation of the optimized support problemwhich we showed to be NP-hard even for the case of oneuninstantiated attribute. The optimized support problemdescribed in [14] required the confidence over all theoptimal regions, considered together, to be greater than acertain minimum threshold. Thus, the confidence of anoptimal region could fall below the threshold and this wasthe reason for its intractability. In [15], we redefined theoptimized support problem such that each optimal region isrequired to have the minimum confidence. This made theproblem tractable for the one attribute case.

    Schemes for clustering quantitative association rules withtwo uninstantiated numeric attributes in the left-hand side

    are presented in [11]. For a given support and confidence, the

    BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 325

  • 8/12/2019 Mining Optimized Gain Rules

    3/15

    authors present a clustering algorithm to generate a set ofnonoverlapping rectangles such that every point in eachrectangle has the required confidence and support. Ourschemes, on the other hand, compute an optimal set ofnonoverlapping rectangles with the maximum gain. Further,in our approach, we only require that each rectangle in the

    optimal set have minimum confidence; however, individual

    points in a rectangle may not have the required confidence.Recent work on histogram construction, presented in [ 9],

    is somewhat related to our optimized rule computationproblem. In [9], the authors propose a dynamic program-

    ming algorithm to compute V-optimal histograms for asingle numeric attribute. The problem is to split theattribute domain into k buckets such that the sum squarederror over the k buckets is minimum. Our algorithm forcomputing the optimized gain rule (for one attribute) differsfrom the histogram construction algorithm of [9] in anumber of respects. First, our algorithm attempts tomaximize the gain, which is very different from minimizingthe sum squared error. Second, histogram construction

    typically involves identifying bucket boundaries, while ouroptimized gain problem requires us to compute optimalregions (that may not share a common boundary). Finally,our algorithm has a linear time dependency on the size ofthe attribute domainin contrast, the histogram construc-tion algorithm of [9] has a time complexity that is quadraticin the number of distinct values of the attribute underconsideration.

    In [4], the authors propose a general framework foroptimized rule mining, which can be used to express ouroptimized gain problem as a special case. However, thegenerality precludes the development of efficient algo-

    rithms for computing optimized rules. Specifically, theauthors use a variant of Dense-Miner from [5] which

    essentially relies on enumerating optimized rules in orderto explore the search space. Since there are an exponentialnumber of optimized rules, the authors propose pruningstrategies to reduce the search space and, thus, improve theefficiency of the search. However, in the worst case, thetime complexity of the algorithm from [5] is still exponentialin the size of attribute domains. In contrast, our algorithmsfor computing the optimized gain rule have polynomialtime complexity (linear complexity for one numeric

    attribute) since they exploit the specific properties of gainand one and two-dimensional spaces.

    3 PROBLEM FORMULATION

    In this section, we define the optimized association rules

    problem addressed in the paper. The data is assumed to be

    stored in a relation defined over categorical and numeric

    attributes. Association rules are built from atomic condi-

    tions, each of which has the formAi vi(Aicould be either

    categorical or numeric) and Ai2 li; ui (only if Ai is

    numeric). For the atomic condition Ai2 li; ui, if li and uiare values from the domain ofAi, the condition is referred

    to asinstantiated; otherwise, if they are variables, we refer to

    the condition asuninstantiated.

    Atomic conditions can be combined using operators^or

    _ to yield more complex conditions. Instantiated association

    rules, which we study in this paper, have the formC1 ! C2,

    whereC1 and C2 are arbitrary instantiated conditions. Let

    the support for an instantiated condition C, denoted by

    supC, be the ratio of the number of tuples satisfying the

    conditionCand the total number of tuples in the relation.Then, for the association rule R:C1 ! C2,supRis defined

    assupC1 andconfR is defined assupC1^C2

    supC1 . Note that our

    definition of supR is different from the definition in [2],

    wheresupR was defined to be supC1^ C2. Instead, we

    have adopted the definition of support used in [6], [7], [14],

    [15]. Also, let minConf denote the user-specified minimum

    confidence. Then, gainR is defined to be the difference

    betweensupC1^ C2 and minConf times supC1. In other

    words,gainRis

    supC1^ C2 minConf supC1

    supR confR minConf:The optimized association rule problem requires optimal

    instantiations to be computed for an uninstantiated associa-tion rule that has the form: U^ C1 ! C2, where U is aconjunction of one or two uninstantiated atomic conditionsover distinct numeric attributes andC1 and C2are arbitraryinstantiated conditions. For simplicity, we assume that thedomain of an uninstantiated numeric attributeis f1; 2; . . . ; ng.Depending on the number, one or two, of uninstantiatednumeric attributes, consider a one or two-dimensional spacewithanaxisforeachuninstantiatedattributeandvaluesalongeachaxis corresponding to increasingvalues fromthe domainof the attributes. Note that if we consider a single interval in

    the domain of each uninstantiated attribute, then theircombination results in a region. Forthe one-dimensional case,thisregion l1; u1 issimplytheintervall1; u1 for the attribute;for the two-dimensionalcase, the region l1; l2; u1; u2 istherectangle bounded along each axis by the endpoints of theintervals l1; u1 and l2; u2 along the two axis.

    Suppose, for a region R l1; u1, we define confR,supR, andgainRto beconf,sup, andgain, respectively,for the rule

    A1 2 l1; u1 ^ C1 ! C2

    (similarly, for R l1; l2; u1; u2, confR, supR, andgainR are defined t o b e conf, sup, a nd gain for

    A1 2 l1; u1 ^ A2 2 l2; u2 ^ C1 ! C2). In addition, for a setof nonoverlapping regions,

    S fR1; R2; . . . ; Rjg; Ri li1; ui1;

    suppose we define confS, supS, and gainS to be theconf, sup, and gain, respectively, of the rule

    _ji1A1 2 li1; ui1 ^ C1 ! C2:

    For two dimensions, in which case each

    Ri li1; li2; ui1; ui2;

    confS,supS, andgainS are defined to be the conf,sup,

    andgain, respectively, of the rule

    326 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

  • 8/12/2019 Mining Optimized Gain Rules

    4/15

    _ji1A12 li1; ui1 ^ A2 2 li2; ui2 ^ C1! C2:

    Then, since R1; . . . ; Rj are nonoverlapping regions, thefollowing hold for set S:

    supS supR1 supRj

    confS supR1 confR1 supRj confRj

    supR1 supRj

    gainS gainR1 gainRj:

    Having defined the above notation, we present below theformulation of the optimized association rule problem forgain.

    Problem Definition (Optimized Gain).Givenk, determine aset Scontaining at most k regions such that, for each regionRi2 S, confRi minConf andgainS is maximized.

    We refer to the set Sas the optimized gain set.

    Example 3.1. Consider the telecom service providerdatabase (discussed in Section 1) containing call detaildata for a one week period. Fig. 1 presents the summaryof the relation for the seven daysthe summaryinformation includes, for each date, the total # of callsmade on the date, the # of calls from NY, and the # of

    calls fromNY to France. Also included in the summaryare the support, confidence, and gain, for each date v, ofthe rule

    date v ^ src city NY ! dst country France:

    The total number of calls made during the week is 2,000.Suppose we are interested in discovering the interest-

    ing periods with heavy call volume fromNYtoFrance(aperiod is a range of consecutive days). Then, thefollowing uninstantiated association rule can be used.

    date2 l; u ^ src city NY! dst country France:

    In the above rule, U is date2 l; u,C1 is src city NY

    andC2is dst country France. Let us assume that we areinterested in at most two periods (that is, k 2) withminConf 0:50. An optimized gain set is f5; 5; 7; 7g; werequire up to two periods such that the percentage of callsduring each of the periods from NY that are to France isatleast 50 percent and the gain is maximized. Of the possibleperiods 1; 1, 5; 5, and 7; 7, the gain in period 5; 5 is12:5 103 and both 1; 1 and 7; 7 have gains of2:5 103. Thus, both f5; 5; 7; 7g and f1; 1; 5; 5g areoptimized gain sets.

    In the remainder of the paper, we shall assume that thesupport, confidence, and gain for every point in a region are

    availablethese can be computed by performing a single

    pass over the relation. The points, along with their supports,confidences, and gains, thus constitute the input to ouralgorithms. Thus, the input size isnfor the one-dimensionalcase, while, for the two-dimensional case, it is n2.

    4 ONENUMERIC ATTRIBUTE

    In this section, we tackle the problem of computing theoptimized gain set when association rules contain a singleuninstantiated numeric attribute. Thus, the uninstantiatedrule has the form: A12 l1; u1 ^ C1 ! C2, where A1 is the

    uninstantiated numeric attribute. We propose an algorithmwith linear time complexity for computing the optimizedgain set (containing up to k nonoverlapping intervals) inSection 4.2. But first, in Section 4.1, we present preproces-sing algorithms for collapsing certain contiguous ranges ofvalues in the domain of the attribute into a single bucket,thus reducing the size of the input n.

    4.1 Bucketing

    For the one-dimensional case, each region is an interval andsince the domain size is n, the number of possible intervalsis On2. Now, suppose we could split the range 1; 2; . . . ; ninto b buckets, where b < n, and map every value in A1s

    domain into one of the b buckets to which it belongs. Then,the new domain ofA1 becomesf1; 2; . . . ; bgand the numberof intervals to be considered becomes Ob2which couldbe much smaller, thus reducing the time and spacecomplexity of our algorithms. Note that the reduction inspace complexity also results in reduced memory require-ments for our algorithms.

    In the following, we present a bucketing algorithm that1) does not compromise the optimality of the optimized set(that is, the optimized set computed on the buckets isidentical to the one computed using the raw domain values)and 2) has time complexity On. The output of thealgorithm is the b buckets with their supports, confidences,

    and gains and this becomes the input to the algorithm forcomputing the optimized gain set in Section 4.2.For optimized gain sets, we begin by making the

    following simple observationvalues in A1s domainwhose confidence is exactly minConf have a gain of 0 andcan be thus ignored. Including these values in theoptimized gain set does not affect the gain of the set and,so, we can assume that, for every value in f1; 2; . . . ; ng,either the confidence is greater than minConf or less thanminConf.

    The bucketing algorithm for optimized gain collapsescontiguous values whose confidence is greater thanminConf into a single bucket. It also combines contiguous

    values each of whose confidence is less than minConfinto a

    BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 327

    Fig. 1. Summary of call detail data for a one week period.

  • 8/12/2019 Mining Optimized Gain Rules

    5/15

    single bucket. Thus, for any interval assigned to a bucket, itis the case that either all values in the interval haveconfidence greater thanminConfor all values in the intervalhave confidence less than minConf.

    For instance, let the domain of A1 be f1; 2; . . . ; 6g andconfidences of 1, 2, 5, and 6 be greater than minConf, whileconfidences of 3 and 4 are less than minConf. This isillustrated in Fig. 2 with symbols and indicating apositive and negative gain, respectively, for domain values.

    Then, our bucketing scheme generates three bucketsthefirst containing values 1 and 2, the second 3 and 4, and thethird containing values 5 and 6. It is straightforward toobserve that assigning values to buckets can be achieved byperforming a single pass over the input data and thus haslinear time complexity.

    In order to show that the above bucketing algorithm doesnot violate the optimality of the optimized set, we use theresult of the following theorem.

    Theorem 4.1. Let S be an optimized gain set. Then, for anyintervalu; vin S, it is the case that

    confu 1; u 1< minConf;

    confv 1; v 1< minConf;confu; u> minConf;

    andconfv; v> minConf.

    Proof. Note that confu; v> minConf. As a result, ifconfu 1; u 1> minConf, t he n confu 1; v>minConfand, since gainu 1; u 1> 0,

    gainu 1; v> gainu; v:

    Thus, the set S fu; vg [ fu 1; vg has higher gainand is the optimized gain setthus leading to acontradiction. A similar argument can be used to showthatconfv 1; v 1< minConf.

    On the other hand, if

    confu; u< minConf;then gainu; u< 0 and, since confu; v> minConf,confu 1; v> minConf. Also,

    gainu 1; v> gainu; v

    and, thus, the set S fu; v g [ f u 1; vg has highergain and is the optimized gain setthus, leading to acontradiction. A similar argument can be used to showthatconfv; v> minConf. tu

    From the above theorem, it follows that if u; v is aninterval in the optimized set, then valuesuandu 1cannotboth have confidences greater than or less than minConfthesame holds forvalues v and v 1.Thus,forasetofcontiguousvalues, if the confidence of each and every value is greaterthan (or is less than) minConf, then the optimized gain seteither contains all of the values or none of them. Thus, aninterval in the optimized set either contains all the values in abucket or none of themas a result, the optimized set can becomputed using the buckets instead of the original values inthe domain.

    4.2 Algorithm for Computing Optimized Gain Set

    In this section, we present an Obk algorithm for theoptimized gain problem for one dimension. The input to thealgorithm is the b buckets generated by our bucketingscheme in Section 4.1 along with their confidences,

    supports, and gains. The problem is to determine a set ofat most k (nonoverlapping) intervals such that the con-fidence of each interval is greater than or equal to minConfand gain of the set is maximized.

    Note that, due to our bucketing algorithm, bucketsadjacent to a bucket with positive gain have negative gainand vice versa. Thus, if there are at most k buckets withpositive gain, then these buckets constitute the desired

    328 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

    Fig. 2. Example of buckets generated.

    Fig. 3. Algorithm for computing optimized gain set.

  • 8/12/2019 Mining Optimized Gain Rules

    6/15

    optimized gain set. Otherwise, procedure optGain1D, shownin Fig. 3, is used to compute the optimized set. For anintervalI, we denote by maxI, the subinterval ofIwithmaximum gain. Also, we denote by minI, the subintervalof Iwhose gain is minimum. Note that, for an interval I,minI andmaxI can be computed in time that is linear inthe size of the interval. This is due to the following dynamicprogramming relationship for the gain of the subinterval ofIwith the maximum gain and ending at point u (denoted bymaxu):

    maxu maxfgainu; u;maxu 1 gainu; ug:

    (A similar relationship can be derived for the subinterval

    with minimum gain).Thek desired intervals are computed by optGain1D in k

    iterationsthe ith iteration computes the i intervals withthe maximum gain using the results of the i 1th iteration.After the i 1th iteration, PSet is the optimized gain setcontainingi 1intervals, while the remaining intervals notin PSet are stored in NSet. After Pq and Nq have beencomputed, as described in Steps 3-4, if

    gainminPq gainmaxNq< 0;

    then it follows that the gain of minPq is more negativethan the gain ofmaxNqis positive. Thus, the best strategyfor maximizing gain is to split Pq into two subintervals

    using minPqas the splitting interval and include the twosubintervals in the optimized gain set (Steps 6-8). On theother hand, ifgainminPq gainmaxNq 0, then thegain can be maximized by adding maxNq to theoptimized gain set (Steps 11-13). Note that if PSet/NSetis empty, then we cannot compute Pq=Nqand, so,

    gainminPq=gainmaxNq

    in Step 5 is 0.

    Example 4.2.Consider the six buckets 1; 2; . . . ; 6with gains10, -15, 20, -15, 20, and -15 shown in Fig. 4a. We trace theexecution of optGain1D assuming that we are interestedin computing the optimized gain set containing two

    intervals.

    Initially, NSet is set tof1; 6g(see Fig. 4a). During thefirst iteration of optGain1D, Nqis1; 6since it is the onlyinterval in NSet. Furthermore, maxNq 3; 5(the darksubinterval in Fig. 4a) and gainmaxNq 25. Since

    PSet is empty, gainminPq 0 and Nq is split intothree intervals 1; 2, 3; 5, and 6; 6, of which 3; 5 isadded to PSet and 1; 2 and 6; 6 are added to NSet (afterdeleting1; 6from it). The sets PSet and NSet at the endof the first iteration are depicted in Fig. 4b.

    In thesecond iteration, Pq 3; 5 (minPq 4; 4)and

    Nq 1; 2 (maxNq 1; 1) (since gainmax1; 2 10

    is larger thangainmax6; 6 15). Thus, since

    gainminPq gainmaxNq 5;

    3; 5 is split into three intervals 3; 3, 4; 4, and 5; 5, of

    which 3; 3 and 5; 5 are added to PSet (after deleting

    3; 5 from it), which is the desired optimized gain set.

    The dark subintervals in Fig. 4c denotes the minimum

    and maximum gain subintervals of Pq and Nq, respec-

    tively, and the final intervals in PSet and NSet (after the

    second iteration) are depicted in Fig. 4d.

    We can show that the above simple greedy strategy

    computes thei intervals with the maximum gain (in the ith

    iteration). We first show that, after the ith iteration, the

    intervals in PSet and NSet satisfy the following conditions(let the i intervals in PSet be P1; . . . ; Pi and the remaining

    intervals in NSet be N1; . . . ; Nj).

    . Cond 1. Let u; v be an interval in PSet. For allu l v, gainu; l 0 and gainl; v 0.

    . Cond 2. Let u; v be an interval in NSet. For allu l v, gainu; l 0 (except when u 1) andgainl; v 0 (except when v b).

    . Con d 3 . F or all 1 l i, 1 m j, gainPlgainmaxNm.

    . Cond 4. For all 1 l i, 1 m j, gainminPl gainNm(except for Nm that contains one of the

    endpoints, 1 or b).

    BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 329

    Fig. 4. Execution trace of procedure optGain1D. (a) Before first iteration, (b) after first iteration, (c) before second iteration, and (d) after seconditeration.

  • 8/12/2019 Mining Optimized Gain Rules

    7/15

    . Cond 5. For all 1 l; m i, l6m, gainminPl gainPm 0.

    . Cond 6.For all 1 l; m j, l 6m,gainmaxNl gainNm 0 (except for Nm that contain one of theendpoints, 1 or b).

    For an interval u; vin PSet or NSet, Conditions 1 and 2

    state properties about the gain of its subintervals that

    contain u or v. Simply put, they state that extending or

    shrinking the intervals in PSet does not cause its gain toincrease. Condition 3 states that the gain of PSet cannot be

    increased by replacing an interval in PSet by one contained

    in NSet, while Conditions 4 and 5 state that splitting an

    interval in PSet and merging two other adjacent intervals in

    it or deleting an interval from it cannot increase its gain

    either. Finally, Condition 6 covers the case in which two

    adjacent intervals in PSet are merged and an additional

    interval from NSet is added to itCondition 6 states that

    these actions cannot cause PSets gain to increase.

    Lemma 4.3. After the ith iteration of procedure optGain1D, the

    intervals in PSet and NSet satisfy Conditions 1-6.

    Proof.See the Appendix. tu

    We can also show that any set ofi intervals (in PSet) that

    satisfies all of the six above conditions is optimal with

    respect to gain.

    Lemma 4.4.Any set ofiintervals satisfying Conditions 1-6 is an

    optimized gain set.

    Proof.See the Appendix. tu

    From the above two lemmas, we can conclude that, at the

    end of theith iteration, procedure optGain1D computes the

    optimized gain set containing i intervals (in PSet).

    Theorem 4.5. Procedure optGain1D computes the optimizedgain set.

    It is straightforward to observe that the time complexity ofprocedure optGain1D isObksince it performs k iterationsand in each iteration, intervalsPqandNqcan be computed inOb steps.

    5 TWONUMERIC ATTRIBUTES

    We next consider the problem of mining the optimized gainset for the case when there are two uninstantiated numericattributes. In this case, we need to compute a set of knonoverlapping rectangles in two-dimensional space whosegain is maximum. Unfortunately, this problem in NP-hard[10]. In the following section, we describe a dynamicprogramming algorithm with polynomial time complexitythat computes approximations to optimized sets.

    5.1 Approximation Algorithm Using DynamicProgramming

    The procedure optGain2D (see Fig. 5) for computingapproximate optimized gain sets is a dynamic program-ming algorithm that uses simple end-to-end horizontal andvertical cuts for splitting each rectangle into two subrec-tangles. Procedure optGain2D accepts as input parameters,the coordinates of the lower left (i; j) and upper right(p; q) points of the rectangle for which the optimized set isto be computed. These two points completely define therectangle. The final parameter is the bound on the numberof rectangles that the optimized set can contain. The arrayoptSeti; j; p; q; kis used to store the optimized set withsize at mostk for the rectangle, thus preventing recomputa-tions of the optimized set for the rectangle. The confidence,

    support, and gain for each rectangle is precomputedthis

    330 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

    Fig. 5. Dynamic programming algorithm for compputing optimized gain set.

  • 8/12/2019 Mining Optimized Gain Rules

    8/15

    can be done in On4 steps, which is proportional to the

    total number of rectangles possible.In optGain2D, the rectangle i; j; p; qis first split into

    two subrectangles using vertical cuts (Steps 6-13), and later

    horizontal cuts areemployed(Steps 14-21). For k > 1, vertical

    cuts betweeni and i 1,i 1and i 2; . . . ; p 1and p are

    used to divide rectangle i; j; p; q into subrectangles

    i; j; l; q and l 1; j; p; q forall i l p 1. Forevery

    pair of subrectangles generated above, optimized sets of size

    k1 and k2 are computed by recursively invoking optGain2D

    for all k1; k2 such that k1 k2 k. An optimization can be

    employed in casek 1 (Step 7) and, instead of considering

    every vertical cut, it suffices to only consider the vertical cuts

    at the ends since the single optimized rectangle must be

    contained in eitheri; j; p 1; qor i 1; j; p; q. After

    similarly generating pairs of subrectangles using horizontal

    cuts, the optimized set for the original rectangle is set to the

    union of the optimized sets for the pair with the maximum

    gain (function maxGainSet returns theset with themaximum

    gain from among its inputs).

    Example 5.1. Consider the two-dimensional rectangle0; 0; 2; 2 in Fig. 6 for two numeric attributes, eachwith domain f0; 1; 2g. We trace the execution ofoptGain2D for computing the optimized gain set

    containing two nonoverlapping rectangles.Consider the first invocation of optGain2D. In thebody of the procedure, the variablelis varied from 0 to 1for both vertical and horizontal cuts (Steps 10 and 18).Further, the only value for variablem is 1 since k is 2 inthe first invocation. Thus, in Steps 10-12 and 18-20, therectangle0; 0; 2; 2is cut at two points in each of thevertical and horizontal directions and optGain2D iscalled recursively for the two subrectangles due to eachcut. These subrectangle pairs for the horizontal and

    vertical cuts are illustrated in Figs. 7 and 8, respectively.Continuing further with the execution of the first

    recursive call of optGain2D with rectangle 0; 0; 0; 2

    and k 1, observe that the boundary points of therectangle along the horizontal axis are the same. As aresult, since k 1, in Step 16, optGain2D recursivelyinvokes itself with two subrectangles, each of whose sizeis one unit smaller along the vertical axis. Thesesubrectangles are depicted in Fig. 9. The process ofrecursively splitting rectangles is repeated for the newlygenerated subrectangles as well as rectangles due toprevious cuts.

    The number of points input to our dynamic program-

    ming algorithm for the two-dimensional case is Nn2

    since n is the size of the domain of each of the two

    uninstantiated numeric attributes.

    BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 331

    Fig. 6. Rectangle0; 0; 2; 2.

    Fig. 7. Vertical cuts for optGain2D((0,0),(2,2),2).

    Fig. 8. Horizontal cuts for optGain2D((0,0),(2,2),2).

  • 8/12/2019 Mining Optimized Gain Rules

    9/15

    Theorem 5.2. The time complexity of Procedure optGain2D is

    ON2:5k2.

    Proof. The complexity of procedure optGain2D is simply

    the number of times procedure optGain2D is invoked

    multiplied by a constant. The reason for this is that steps

    in optGain2D that do not have constant overhead (e.g.,

    the for loops in Steps 12 and 13) result in recursive calls

    to optGain2D. Thus, the overhead of these steps is

    accounted for in the count of the number of calls to

    optGain2D. Consider an arbitrary rectangle i; j; p; q

    and consider an arbitrary 1 l k. We show that

    optGain2D is invoked with the above parameters at

    most4nktimes. Thus, since the number of rectangles is at

    mostn4 andl can take k possible values, the complexity

    of the algorithm is On5k2.We now show that optGain2D with a given set of

    parameters can be invoked at most 4nk times. The firstobservation is that optGain2D with rectangle i; j; p; q

    and l is invoked only from a different invocation ofoptGain2D with a rectangle that on being cut verticallyor horizontally yields rectangle i; j; p; q and with avalue fork that lies betweenl andk. The number of suchrectangles is at most 4neach of these rectangles can beobtained fromi; j; p; qby stretching it in one of fourdirections and there are onlyn possibilities for stretchinga rectangle in any direction. Thus, each invocation ofoptGain2D can result from 4nk different invocations ofoptGain2D. Furthermore, since the body of optGain2Dfor a given set of input parameters is executed only once,optGain2D with a given input is invoked at most 4nktimes. tu

    5.2 Optimality ResultsProcedure optGain2Ds approach of splitting each rectangle

    into two subrectangles and then combining the optimized

    sets for each subrectangle may not yield the optimized set

    for the original rectangle. This point is further illustrated in

    Fig. 10a that shows a rectangle and the optimized set of

    rectangles for it. It is obvious that there is no way to split the

    rectangle into two subrectangles such that each rectangle in

    the optimized set is completely contained in one of the

    subrectangles. Thus, a dynamic programming approach

    that considers all possible splits of the rectangle into two

    subrectangles (using horizontal and vertical end-to-end

    cuts) and then combines the optimized sets for the

    subrectangles may not result in the optimized set for the

    original rectangle being computed.In the following, we first identify restrictions under

    which optGain2D yields optimized sets. We then show

    bounds on how far the computed approximation for the

    general case can deviate from the optimal solution.

    Let us define a set of rectangles to be binary spacepartitionableif it is possible to recursively partition the plane

    such that no rectangle is cut and each partition contains at

    most one rectangle. The set of rectangles in Fig. 10b is

    binary space partitionable (the bold lines are a partitioning

    of the rectangles)however, the set in Fig. 10a is not.If we are willing to restrict the optimized set to only

    binary space partitionable rectangles, then we can show that

    procedure optGain2D computes the optimized set. Note

    that any set of three or fewer rectangles in a plane is always

    binary space partitionable. Thus, for k 3, optGain2D

    computes the optimized gain set.

    Theorem 5.3.Procedure optGain2D computes the optimized setof binary space partitionable rectangles.

    Proof.The proof is by induction on the size of the rectangles

    that optGain2D is invoked with.Basis. For all 1 l k, optGain2D can be trivially

    shown to compute the optimized binary space partition-able set for the unit rectangle i; i; i; i (if confidence ofthe rectangle is at leastminConf, then the optimized set isthe rectangle itself).

    Induction. We next show that, for any 1 l k andrectangle i; j; p; q, the algorithm computes theoptimized binary space partitionable set (assuming that,for all its subrectangles, for all 1 l k, the algorithm

    computes the optimized binary space partitionable set).We need to consider three cases. The first is when theoptimized set is the rectangle i; j; p; q itself. In thiscase, confi; j; p; q minConf and optSet is cor-rectly set to fi; j; p; qg by optGain2D. In case l 1then, since confi; j; p; q< minConf, the optimizedrectangle must be contained in one of the four largestsubrectangles ini; j; p; qand, thus, the optimized setof size 1 for these subrectangles whose support ismaximum is the optimized set for i; j; p; q. Finally,if l >1 then, since we are interested in computing theoptimized binary space partitionable set, there must exista horizontal or vertical cut that cleanly partitions the

    optimized rectangles, one of the two subrectangles due

    332 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

    Fig. 9. Smaller rectangles recursively considered by optGain2D

    ((0,0),(0,2),1). Fig. 10. Binary space partitionable rectangles.

  • 8/12/2019 Mining Optimized Gain Rules

    10/15

    to the cut containing at most 1 r < l and the othercontaining l r rectangles of the optimized set. Thus,since, in optGain2D, all possible cuts are considered andoptSet is then set to the union of the optimized sets forthe subrectangles such that the resulting gain ismaximum, due to the induction hypothesis, it followsthat this is the optimized gain set for the rectangle

    i; j; p; q. tu

    We next use this result in order to show that, in thegeneral case, the approximate optimized gain set computedby procedure optGain2D is within a factor of 1

    4 of the

    optimized gain set. The proof also uses a result from [1] inwhich it is shown that, for any set of rectangles in a plane,there exists a binary space partitioning (that is, a recursivepartitioning) of the plane such that each rectangle is cut intoat most four subrectangles and each partition contains atmost one subrectangle.

    Theorem 5.4.Procedure optGain2D computes an optimized gainset whose gain is greater than or equal to 1

    4 times the gain of

    the optimized gain set.Proof. From the result in [1], it follows that it is possible to

    partition each rectangle in the optimized set into foursubrectangles such that the set of subrectangles is binaryspace partitionable. Furthermore, for each rectangle,consider its subrectangle with the highest gain. The gainof each such subrectangle is at least 1

    4times the gain for the

    original rectangle. Thus, the set of these subrectangles isbinary space partitionable and has 1

    4 of the gain of the

    optimized set. As a result, due to Theorem 5.3 above, itfollowsthattheoptimizedsetcomputedbyoptGain2Dhasgain that is at least 1

    4times the gain of the optimized set.tu

    6 EXPERIMENTAL RESULTS

    In this section, we study the performance of our algorithmsfor computing optimized gain sets for the one-dimensionaland two-dimensional cases. In particular, we show that ouralgorithm is highly scaleable for one dimension. Forinstance, we can tackle attribute domains with sizes ashigh as one million in a few minutes. For two dimensions,however, the high time and space complexities of ourdynamic programming algorithm make it less suitable forlarge domain sizes and a large number of disjunctions. Wealso present results of our experiments with a real-lifepopulation survey data set where optimized gain sets

    enable us to discover interesting correlations amongattributes.

    In our experiments, the data file is read only once at thebeginning in order to compute the gain for every point. Thetime for this, in most cases, constitutes a tiny fraction of thetotal execution time of our algorithms. Thus, we do notinclude the time spent on reading the data file in our results.Furthermore, note that the performance of our algorithmsdoes not depend on the number of tuples in the data fileitis more sensitive to the size of the attributes domain n andthe number of intervalsk. We fixed the number of tuples inthe data file to be 10 million in all our experiments. Ourexperiments were performed on a Sun Ultra-2/200 machine

    with 512 MB of RAM and running Solaris 2.5.

    6.1 Performance Results on Synthetic Data SetsThe association rule that we experimented with has theform U^ C1 ! C2, where U contains one or two unin-stantiated attributes (see Section 3) whose domains consistof integers ranging from 1 to n. Every instantiation ofU^C1 ! C2 (that is, point in m-dimensional space) is assigneda randomly generated confidence between 0 and 1 withuniform distribution. Each value in m-dimensional space isalso assigned a randomly generated support between 0 and

    2nm with uniform distribution; thus, the average support fora value is 1nm.

    6.1.1 One-Dimensional Data

    Bucketing. We begin by studying the reduction in inputsize due to the bucketing optimization. Table 1 illustratesthe number of buckets for domain sizes ranging from 500 to100,000 whenminConfis set to 0.5. From the table, it followsthat bucketing can result in reductions to input size as highas 65 percent.

    Scale-up with n. The graph in Fig. 11a plots theexecution times of our algorithm for computing optimizedgain sets as the domain size is increased from 100,000 to1 million for a minConf value of 0.5. Note that, for thisexperiment, we turned off the bucketing optimizationsothe running times would be even smaller if we were toemploy bucketing to reduce the input size. The experiments

    validate our earlier analytical results on the Obk timecomplexity of procedure optGain1D. As can be seen fromthe Fig. 11, our optimized gain set algorithm scales linearlywith the domain size as well as k.

    Sensitivity to minConf. Fig. 11b depicts the runningtimes for our algorithm for a range of confidence values anda domain size of 500,000. From the graphs, it follows thatthe performance of procedure optGain1D is not affected byvalues for minConf.

    6.1.2 Two-Dimensional Data

    Scale-up withn . The graph in Fig. 12a plots the executiontimes of our dynamic programming algorithm for comput-

    ing optimized gain sets as the domain sizes n n areincreased from 10 10 to 50 50 in increments of 10. Thevalue ofminConfis set to 0.5 and, in the figure, each point isrepresented along the x-axis with the value ofN n2 (e.g.,10 10 is represented as 100). Note that, for this experi-ment, we do not perform bucketing since this optimizationis applicable only to one-dimensional data. The runningtimes corroborate our earlier analysis of the time complexityofON2:5k2 for procedure optGain2D. Note that, due to thehigh space complexity ofON2k for our dynamic program-ming algorithm, we could not measure the execution timesfor large values of N and k. For these large parametersettings, the system returned an out-of-memory error

    message.

    BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 333

    TABLE 1Values of b for Different Domain Sizes

  • 8/12/2019 Mining Optimized Gain Rules

    11/15

    Sensitivity to minConf. Fig. 12b depicts the running timesof our algorithm for a range of confidence values and adomain size of30 30. From the graph, it follows that theperformance of procedure optGain2D improves with in-creasing valuesfor minConf.Thereasonforthisisthat,athighconfidence values, fewer points have positive gain and, thus,optSet, in most cases, is either empty or contains a smallnumber of rectangles. As a result, operations on optSet likeunion (Steps 12 and 20) and assignment (Steps 8, 12, 16, 20,and22) aremuch fasterand, consequently, theexecution timeof the procedure is lower for high confidence values.

    6.2 Experiences with Real-Life DatasetsIn order to gauge the efficacy of our optimized rule frame-work for discovering interesting patterns, we conductedexperiments with a real-life data set consisting of the currentpopulation survey (CPS) data for the year 1995.1 The CPS isamonthly survey of about 50,000 households and is theprimary source of information on the labor force character-istics of the US population. The CPS data consists of a varietyof attributes that include age (A_AGE), group healthinsurance coverage (COV_HI), hours of work (HRS_WK),household income (HTOTVAL), temporary work experiencefor a few days (WTEMP), and unemployment compensation

    benefits received Y/N-Person (UC_YN). The number ofrecords in the 1995 survey is 149,642.

    Suppose we are interested in finding the age groups that

    contain a large concentration of temporary workers. This

    may be of interest to a job placement/recruiting company

    that is actively seeking temporary workers. Knowledge of

    the demographics of temporary workers can help the

    company target its advertisements better and locate

    candidate temporary workers more effectively. This corre-

    lation between age and temporary worker status can be

    computed in our optimized rule framework using the

    following rule: A AGE2 l; u ! WTEMP YES. T heoptimized gain regions found with minConf = 0.01

    (1 percent) for this rule is presented in Table 2. (The

    domain of the A_AGE attribute varies from 0 to 90 years.)

    From the table, it follows that there is a high concentration

    of temporary workers among young adults (age between 15

    and 23) and seniors (age between 62 and 69). Thus,

    advertising on television programs or Web sites that are

    popular among these age groups would be an effective

    strategy to reach candidate temporary workers. Also,

    observe that the improvement in the total gain slows for

    increasing k until it reaches a point where increasing the

    number of gain regions does not affect the total gain.

    334 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

    1. This data can be downloaded from http://www.bls.census.gov/cps.

    Fig. 12. Performance results for two-dimensional data. (a) Scale-up with N. (b) Sensitivity to minConf.

    Fig. 11. Performance results for one-dimensiomal data. (a) Scale-up with n. (b) Sensitivity to minConf.

  • 8/12/2019 Mining Optimized Gain Rules

    12/15

    We also ran our algorithm to find optimized gain regionsfor the rule A AGE2 l; u ! UC YN NO with minConf= 0.95 (95 percent). The results of this experiment arepresented in Table 3 and can be used by the government,for example, to design appropriate training programs forunemployed individuals based on their age. The time tocompute the optimized set for both real-life experimentswas less than 1 second.

    7 CONCLUDING REMARKS

    In this paper, we generalized the optimized gain associa-

    tion rule problem by permitting rules to contain up to

    k disjunctions over one or two uninstantiated numeric

    attributes. For one attribute, we presented an Onk

    algorithm for computing the optimized gain rule, where

    n is t he number of value s in t he domain of the

    uninstantiated attribute. We also presented a bucketing

    optimization that coalesces contiguous valuesall of

    which have confidence either greater than the minimum

    specified confidence or less than the minimum confidence.

    For two attributes, we presented a dynamic programming

    algorithm that computes approximate gain rulesweshowed that the approximations are within a constant

    factor of the optimized rule using recent results on binary

    space partitioning. For a single numeric attribute, our

    experimental results with synthetic data sets demonstratethe effectiveness of our bucketing optimization and the

    linear scale-up for our algorithm for computing optimized

    gain sets. For two numeric attributes, however, the high

    time and space complexities of our dynamic program-

    ming-based approximation algorithm make it less suitablefor large domain sizes and large number of disjunctions.

    Finally, our experiments with the real-life population

    survey data set indicate that our optimized gain rulescan indeed be used to unravel interesting correlations

    among attributes.

    APPENDIX

    Proof of Lemma 4.3. We use induction to show that, after

    iteration i, the intervals in PSet and NSet satisfy

    Conditions 1-6.Basis(i 0). The six conditions trivially hold initially

    before the first iteration begins.Induction Step. Let us assume that thei 1intervals

    in PSet satisfy Conditions 1-6 after the i 1th iteration

    completes. Let P1; . . . ; Pi1 be the i 1 intervals in PSetand N1; . . . ; Nj be the intervals in NSet after the i 1thiteration. During each iteration, either Pqis split into twosubintervals using minPq, which are then added toPSet, ormaxNqis added to PSet. Both actions result inthree subintervals that satisfy Conditions 1 and 2. Forinstance, let interval Pq u; v be split into threeintervals and let s; t be minPq. For all u l s 1,it must be the case that gain(l; s 1) 0 since,otherwise, s; t would not be the interval with theminimum gain. Similarly, it can be shown that, for alls l t, it must be the case that gains; l 0 since,otherwise, s; t would not be the minimum gain

    subinterval inu; v. The other cases can be shown usinga similar argument.

    Next, we show that splitting intervalPq u; v in Plistinto three subintervals with s; t = minu; v as themiddle interval preserves the remaining four conditions.

    . Cond 3.We need to show thatgainu; s 1andgaint 1; v gainmaxs; t, gainu; s 1,and

    gaint 1; v gainmaxNm;

    and gainPl gainmaxs; t. Wecan showthat

    gains; t gainmaxs; t 0 since, otherwise,

    the subinterval ofs; t preceding maxs; t wouldhave smaller gain thans; t (every subinterval of

    s; t with endpoint at s has gain 0 and, so, the

    sum of the gains ofmaxs; t and the subinterval

    of s; t preceding maxs; t is 0). Since every

    subinterval ofu; v withuas an endpoint has gain

    0, it follows that

    gainu; s 1 gains; t 0:

    Thus, it follows that

    gainu; s 1 gainmaxs; t:

    Similarly, we can show that

    BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 335

    TABLE 2Optimized Gain Regions for age2 l; u ! WTEMP YES

    TABLE 3Optimized Gain Regions for age2 l; u ! UC YN NO

  • 8/12/2019 Mining Optimized Gain Rules

    13/15

    gaint 1; v gainmaxs; t:

    Note that, since Pq u; v is chosen for

    splitting, it must be the case that

    gainmaxNm gains; t 0:

    Also, since gainu; s 1 gains; t 0, we

    can deduce that

    gainu; s 1 gainmaxNm:

    Using an identical argument, it can be shown that

    gaint 1; v gainmaxNm.Finally, for every Pl, gainPl gains; t 0

    (due to Condition 5). Combining this with

    gains; t gainmaxs; t 0;

    we obtain that gainPl gainmaxs; t.. Cond 4. The preservation of this condition

    follows because gainminPl gains; t sinces; tis chosen such that it has the minimum gain

    from among all the minPl. The same argumentcan be used to show that gainminu; s 1andgainmint 1; v gains; t. Also, sincegains; t gainNm (due to Condition 4) ands; t is the subinterval of Pq with the minimumgain, it follows that gainminu; s 1 and

    gainmint 1; v gainNm:

    . Cond 5. gainminu; s 1 gainPm 0 sincegains; t gainPm 0 (due to Condition 5)andgainminu; s 1 gains; t (s; t is thesubinterval with the minimum gain in u; v).Similarly, we can show that

    gainmint 1; v gainPm 0:

    Also, we need to show that

    gainminPl gainu; s 1 0:

    This follows sincegainu; s 1 gains; t 0

    and gainminPl gains; t. Similarly, it can

    be shown that

    gainminPl gaint 1; v 0:

    . Cond 6.Sinces; tis used to splitPq, it is the case

    that

    gainmaxNm gains; t 0:

    We next need to show that

    gainmaxs; t gainNm 0:

    Due to Condition 4, gains; t gainNm and,

    as shown in the proof for Condition 3 above,

    gains; t gainmaxs; t 0. Combining the

    two, we obtaingainmaxs; t gainNm 0.

    Next, we show that splitting interval Nq u; v inNlist into three intervals with s; t maxu; v as themiddle interval preserves the remaining four conditions.

    . Cond 3.Due to Condition 3,gainPl gains; tand gains; t gainmaxu; s 1 since s; tis the subinterval with maximum gain in u; v.Thus,gainPl gainmaxu; s 1 and we canalso similarly show that

    gainPl gainmaxt 1; v:

    Also, since s; t has the maximum gain fromamongmaxNm, it follows that

    gains; t gainmaxNm:

    . Cond 4. We first show that gainmins; t gains; t 0 since, if

    gainmins; t gains; t< 0;

    then the subinterval of s; t preceding mins; twould have gain greater than s; t (since everysubinterval ofs; tbeginning atshas gain greaterthan or equal to 0, it follows that the sum of the

    gains ofmins; tand the subinterval precedingmins; t in s; t is 0). Also, since everysubinterval of Nq beginning at u has gain lessthan or equal to 0 (assuming u 61), we get

    gainu; s 1 gains; t 0:

    Combining the two, we get gainmins; t gainu; s 1(except when u; s 1contains anendpoint). Using an identical argument, we canalso show that gainmins; t gaint 1; v(except whent 1; vcontains an endpoint).

    Also, due to Condition 6,

    gains; t gainNm 0;

    which, when combined with

    gainmins; t gains; t 0;

    implies that gainmins; t gainNm (exceptwhen Nm contains an endpoint). Finally, sinces; t is used to split interval Nq, it follows thatgainminPl gains; t 0. Furthermore, ifu61, we have gainu; s 1 gains; t 0.Combining the two, we get gainminPl gainu; s 1 (except when u; s 1 containsan endpoint). Similarly, we can show that

    gainminPl gaint 1; v (except when t 1; vcontains an endpoint).. Cond 5. Since Nq is the interval chosen for

    splitting, gainminPl gains; t 0. Due toCondition 3, gainPl gains; tand, as shownin the proof of Condition 4 above,

    gainmins; t gains; t 0;

    from which we can deduce that

    gainmins; t gainPl 0:

    . Cond 6. Since Nq is the interval chosen for

    splitting,gainmaxNl gains; t. Also, since

    336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

  • 8/12/2019 Mining Optimized Gain Rules

    14/15

    gainu; s 1 gains; t 0

    (except whenu 1), we get

    gainmaxNl gainu; s 1 0

    (except when u; s 1 contains an endpoint).Similarly, we can show that gainmaxNl gaint 1; v 0 (except whent 1; vcontainsan endpoint).

    Again, since s; t is used to split Nq,gainmaxu; s 1 gains; t. Also, due toCondition 6, gains; t gainNm 0, thusyielding

    gainmaxu; s 1 gainNm 0

    (except when Nm contains an endpoint). Using asimilar argument, it can be shown that

    gainmaxt 1; v gainNm 0

    (except whenNm contains an endpoint). tu

    Proof of Lemma 4.4. LetP1; P2; . . . ; Pibe i intervals in PSetsatisfying Conditions 1-6. In order to show that this is anoptimizedgain set, we show that thegain of every other setof i intervals is no larger than that of P1; P2; . . . ; Pi.Consider any set ofi intervals PSet0 P01; P

    02; . . . ; P

    0i . We

    transform these set of intervals in a series of steps toP1; P2; . . . ; Pi. Each step ensures that the gain of thesuccessive set of intervals is at least as high as thepreceeding set. As a result, it follows that the gain ofP1; P2; . . . ; Piisat least ashighas any other set ofi intervalsand, thus, fP1; . . . ; Pig is an optimized gain set.

    The steps involved in the transformation are asfollows:

    1. For an interval P0j u; v that intersects one ofthe Pls do the following: If u2 Pl for somePl s; t, then if s; u 1 does not intersect anyother P0ms, modify P

    0j to be s; v (that is, delete

    u; v from PSet0 and add s; v to PSet0). Notethat, due to Condition 1, gains; u 1 0 and,so, gains; v gainu; v. Similarly, if v2 Pl,for some Pl s; t, then if v 1; t does notintersect any other P0ms, modify P

    0j to be u; t.

    On the other hand, if, for an interval P0j u; vthat intersects one of the Pls,u 62Plfor allPl, thenlet m be the max value such that u; m does notintersect with any of the Pls. Modify P

    0

    j

    to bem 1; v. N ot e t ha t, d ue t o C on di ti on 2 ,gainu; m 0 and, so,

    gainm 1; v gainu; v:

    Similarly, ifv 62Plfor allPl, then let m be the minvalue such that m; vdoes not intersect with anyof the Pls. Modify P0j to be u; m 1.

    Thus, at the end of Step 1, each interval P0j inPSet0(some of which may have been modified)either does not intersect any Pl or, if it doesintersect an interval Pl, then each endpoint ofP

    0j

    lies in some interval Pm and each endpoint ofPl

    lies in some interval P

    0

    m. Also, note that if two

    intervalsPl andP0

    j overlap and intersect with no

    other intervals, then at the end of Step 1, P0j Pl.2. In this step, we transform all P0js in PSet

    0 thatintersect multiple Pls. Consider a P

    0j u; v that

    intersects multiple Pls ( th at i s, s pan s a nNm s; t). Thus, since there are k intervals,we need to consider two possible cases: 1) There

    is a P0m that does not intersect with any Pl and2) some Pl intersects with multiple P

    0ms. For

    Case 1, due to Condition 6, it follows thatgainP0m gains; t 0 and, thus, deleting P

    0m

    from PSet0 and splitting P0j into u; s 1 andt 1; v (that is, deleting u; v from PSet0 andadding u; s 1 and t 1; v to it) does notcause the resulting gain of PSet0 to decrease. ForCase 2, due to Condition 4, it follows thatmerging any two adjacent intervals that intersectwith Pl and splitting P

    0j into u; s 1 and t

    1; v does not cause the gain of PSet0 to reduce.This procedure can be repeated to get rid of all

    P0

    js in PSet0

    that overlap with multiple Pls. Atthe end of Step 2, each P0j in PSet0 overlaps with

    at most one Pl. As a result, for every Pl thatoverlaps with multiple P0js or every P

    0j that does

    not overlap with any Pl, there exists a Pm thatoverlaps with no P0js.

    3. Finally, consider the Pms that do not intersectwith any of the P0j s in PSet

    0. We need to considertwo possible cases: 1) There is a P0j that does notintersect with anyPl, or 2) somePlintersects withmultiple P0js. For Case 1, due to Condition 3, itfollows thatgainPm gainP

    0j and, thus, delet-

    ingP0j and addingPm toPSet0 does not cause the

    overall gain ofPSet0 to decrease. For Case 2, dueto Condition 5, it follows that merging any twoadjacent intervals P0j that intersect with Pl andaddingPm to PSet

    0 does not cause PSet0s gain toreduce. This procedure can be repeated untilevery Pl intersects with exactly one P

    0j, thus

    making them identical. tu

    ACKNOWLEDGMENTS

    Without the support of Yesook Shim, it would have been

    impossible to complete this work. The work of Kyuseok

    Shim was partially supported by the Korea Science and

    Engineering Foundation (KOSEF) through the Advanced

    Information Technology Research Center (AITrc).

    REFERENCES[1] F.D. Amore and P.G. Franciosa, On the Optimal Binary Plane

    Partition for Sets of Isothetic Rectangles, Information ProcessingLetters,vol. 44, no. 5, pp. 255-259, Dec. 1992.

    [2] R. Agrawal, T. Imielinski, and A. Swami, Mining AssociationRules between Sets of Items in Large Databases, Proc. ACMSIGMOD Conf. Management of Data, pp. 207-216, May 1993.

    [3] R. Agrawal and R. Srikant, Fast Algorithms for MiningAssociation Rules,Proc. VLDB Conf., Sept. 1994.

    [4] R.J. Bayardo and R. Agrawal, Mining the Most InterestingRules, Proc. ACM SIGKDD Conf. Knowledge Discovery and Data

    Mining,1999.

    BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 337

  • 8/12/2019 Mining Optimized Gain Rules

    15/15

    [5] R.J. Bayardo, R. Agrawal, and D. Gunopulos, Constraint-BasedRule Mining in Large, Dense Databases, Proc. Intl Conf. DataEng., 1997.

    [6] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, MiningOptimized Association Rules for Numeric Attributes, Proc. ACMSIGACT-SIGMOD-SIGART Symp. Principles of Database Systems,

    June 1996.[7] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data

    Mining Using Two-Dimensional Optimized Association Rules:

    Scheme, Algorithms, and Visualization, Proc. ACM SIGMODConf. Management of Data, June 1996.[8] J. Han and Y. Fu, Discovery of Multiple-Level Association Rules

    from Large Databases, Proc. VLDB Conf., Sept. 1995.[9] H.V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.

    Sevcik, and T. Suel, Optimal Histograms with Quality Guaran-tees,Proc. VLDB Conf., Aug. 1998.

    [10] S. Khanna, S. Muthukrishnan, and M. Paterson, On Approximat-ing Rectangle Tiling and Packing, Proc. Ninth Ann. Symp. Discrete

    Algorithms (SODA),pp. 384-393, 1998.[11] B. Lent, A. Swami, and J. Widom, Clustering Association Rules,

    Proc. Intl Conf. Data Eng., Apr. 1997.[12] H. Mannila, H. Toivonen, and A. Inkeri Verkamo, Efficient

    Algorithms for Discovering Association Rules, Proc. AAAIWorkshop Knowledge Discovery in Databases (KDD-94), pp. 181-192, July 1994.

    [13] J.S. Park, M.-S. Chen, and P.S. Yu, An Effective Hash Based

    Algorithm for Mining Association Rules, Proc. ACM-SIGMODConf. Management of Data, May 1995.[14] R. Rastogi and K. Shim, Mining Optimized Association Rules for

    Categorical and Numeric Attributes, Proc. Intl Conf. Data Eng.,1998.

    [15] R. Rastogi and K. Shim, Mining Optimized Support Rules forNumeric Attributes,Proc. Intl Conf. Data Eng., 1999.

    [16] R. Srikant and R. Agrawal, Mining Generalized AssociationRules,Proc. VLDB Conf., Sept. 1995.

    [17] R. Srikant and R. Agrawal, Mining Quantitative AssociationRules in Large Relational Tables, Proc. ACM SIGMOD Conf.

    Management of Data,June 1996.[18] A. Savasere, E. Omiecinski, and S. Navathe, An Efficient

    Algorithm for Mining Association Rules in Large Databases,Proc. VLDB Conf., Sept. 1995.

    Sergey Brin received the bachelor of science

    degree with honors in mathematics and compu-ter science from the University of Maryland atCollege Park. He is currently on leave from thePhD program in computer science at StanfordUniversity, where he received his mastersdegree. He is currently cofounder and presidentof Google, Inc. He met Larry Page at Stanfordand worked on the project that became Google.Together, they founded Google, Inc. in 1998. He

    is a recipient of a US National Science Foundation Graduate Fellowship.He has been a featured speaker at a number of national andinternational academic, business, and technology forums, includingthe Academy of American Achievement, European Technology Forum,Technology, Entertainment and Design, and Silicon Alley, 2001. Hisresearch interests include search engines, information extraction fromunstructured sources, and data mining of large text collections andscientific data. He has published more than a dozen publications in

    leading academic journals, including Extracting Patterns and Relationsfrom the World Wide Web;Dynamic Data Mining: A New Architecture forData with High Dimensionality, which he published with Larry Page;Scalable Techniques for Mining Casual Structures; Dynamic ItemsetCounting and Implication Rules for Market Basket Data; and BeyondMarket Baskets: Generalizing Association Rules to Correlations.

    Rajeev Rastogi received the BTech degree incomputer science from the Indian Institute ofTechnology, Bombay, in 1988 and the mastersand PhD degrees in computer science from theUniversity of Texas, Austin, in 1990 and 1993,respectively. He is the director of the InternetManagement Research Department at BellLaboratories, Lucent Technologies. He joinedBell Laboratories in Murray Hill, New Jersey, in

    1993 and became a distinguished member ofthe technical staff (DMTS) in 1998. Dr. Rastogi is active in the field ofdatabases and has served as a program committee member for severalconferences in the area. His writings have appeared in a number of ACMand IEEE publications and other professional conferences and journals.His research interests include database systems, storage systems,knowledge discovery, and network management. His most recentresearch has focused on the areas of network management, datamining, high-performance transaction systems, continuous-media sto-rage servers, tertiary storage systems, and multidatabase transactionmanagement.

    Kyuseok Shim received the BS degree inelectrical engineering from Seoul National Uni-versity in 1986 and the MS and PhD degrees incomputer science from the University of Mary-land, College Park, in 1988 and 1993, respec-tively. He is currently an assistant professor atSeoul National University in Korea. Previously,he was an assistant professor at the KoreaAdvanced Institute of Science and Technology(KAIST), Korea. Before joining KAIST, he was a

    member of the technical staff (MTS) and one of the key contributors tothe Serendip data mining project at Bell Laboratories. Before that, heworked on the Quest Data Mining project at the IBM Almaden ResearchCenter. He also worked as a summer intern for two summers at HewlettPackard Laboratories. Dr. Shim has been working in the area ofdatabases focusing on data mining, data warehousing, query processingand query optimization, XML and, semistructured data. He is currentlyan advisory committee member for ACM SIGKDD and an editor of theVLDB Journal. He has published several research papers in prestigiousconferences and journals. He has also served as a program committeemember of the ICDE 97, KDD 98, SIGMOD 99, SIGKDD 99, andVLDB00 conferences. He did a data mining tutorial with Rajeev Rastogiat ACM SIGKDD 99 and a tutorial with Surajit Chaudhuri on storage andretrieval of XML data using relational DB at VLDB 01.

    .For more information on this or any computing topic, please visitour Digital Library at http://computer.org/publications/dlib.

    338 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003