mining optimized gain rules

8/12/2019 Mining Optimized Gain Rules

1/15


2/15

. Gain of R is maximized and confidence of R is atleast the user-specified minimum confidence (re-ferred to as the optimized gainrule).

Optimized association rules are useful for unravelingranges for numeric attributes where certain trends orcorrelations are strong (that is, have high support, con-fidence, or gain). For example, suppose the telecom service

provider mentioned earlier was interested in offering apromotion to NY customers who make calls to France. Inthis case, the timing of the promotion may be criticalforits success, it would be advantageous to offer it close to aperiod of consecutive days in which the percentage of callsfrom NY that are directed to France is maximum. Theframework developed in [6] can be used to determine suchperiods. Consider, for example, the association rule

date2 l1; u1 ^ src city NY ! dst country France:

With a minimum confidence of 0.5, the optimized gain ruleresults in the period in which the calls from NYto Franceexceeds 50 percent of the total calls from NY and,furthermore, the number of these excess calls is maximum.

A limitation of the optimized association rules dealt within [6] is that only a single optimal interval for a singlenumeric attribute can be determined. However, in anumber of applications, a single interval may be aninadequate description of local trends in the underlyingdata. For example, suppose the telecom service provider isinterested in doing up tok promotions for customers inNYcalling France. For this purpose, we need a mechanism toidentify up to k periods during which a sizeable fraction ofcalls are from NY to France. If association rules werepermitted to contain disjunctions of uninstantiated condi-tions, then we could determine the optimal k (or fewer)periods by finding optimal instantiations for the rule:

date2 l1; u1 _ _ date2 lk; uk ^ src city NY

!dst country France:

This information can be used by the telecom serviceprovider to determine the most suitable periods for offeringdiscounts on international long distance calls toFrance. Theabove framework can be further strengthened by enrichingassociation rules to contain more than one uninstantiatedattribute, as is done in [7]. Thus, optimal instantiations forthe rule

date2 l1; u1 ^ duration2 l1l1; u1u1 _ _ date2 lk; uk^ duration2 lklk; ukuk ! dst country France

would yield valuable information about types of calls (interms of their duration) and periods in which a substantialportion of the call volume is directed toFrance.

1.2 Our Contributions

In this paper, we consider the generalized optimized gainproblem. Unlike [6] and [7], we permit rules to contain up tok disjunctions over one or two uninstantiated numericattributes. Thus, unlike [6] and [7], that only compute asingle optimal region, our generalized rules enable up to

k optimal regions to be computed.

Furthermore, unlike [15], in which we only addressedthe optimized support problem, in this paper, we focus onthe optimized gain problem and consider both the one andtwo attribute cases. In addition, for rules containing a singlenumeric attribute, we develop an algorithm for computingthe optimized gain rule whose complexity isOnk, wherenis the number of values in the domain of the uninstantiated

attribute (the dynamic programming algorithm for opti-mized support that we presented in [15] had complexityOn2k). We also propose a bucketing optimization that canresult in significant reductions in input size by coalescingcontiguous values. For two numeric attributes, we present adynamic programming algorithm that computes approx-imate association rules. Using recent results on binary spacepartitioning trees, we show that, for the optimized gaincase, the approximations are within a constant factor (of 1

4)

of the optimal solution. Our experimental results withsynthetic data sets for a single numeric attribute demon-strate that our algorithms scale up linearly with theattributes domain size as well as the number of disjunc-

tions. In addition, we show that applying our optimizedrule framework to a population survey real-life data setenables us to discover interesting underlying correlationsamong the attributes.

The remainder of the paper is organized as follows: InSection 2, we discuss related work and, in Section 3, weintroduce the necessary definitions and problem formula-tion for the optimized gain problem. We present our lineartime complexity algorithm for computing the optimizedgain rule for a single numeric attribute in Section 4. InSection 5, we develop a dynamic programming algorithmfor two numeric attributes and show that the computedgain is within a constant factor of the optimal. We presentthe results of our experiments with synthetic and real-lifedata sets in Section 6. Finally, we offer concluding remarksin Section 7.

2 RELATED WORK

In [15], we generalized the optimized association rulesproblem for support, described in [6]. We allowed associa-tion rules to contain up to k disjunctions over oneuninstantiated numeric attribute. For one attribute, wepresented a dynamic programming algorithm for computingthe optimized support rule and whose complexity isOn2k, where n is the number of values in the domain ofthe uninstantiated attribute. In [14], we considered a

different formulation of the optimized support problemwhich we showed to be NP-hard even for the case of oneuninstantiated attribute. The optimized support problemdescribed in [14] required the confidence over all theoptimal regions, considered together, to be greater than acertain minimum threshold. Thus, the confidence of anoptimal region could fall below the threshold and this wasthe reason for its intractability. In [15], we redefined theoptimized support problem such that each optimal region isrequired to have the minimum confidence. This made theproblem tractable for the one attribute case.

Schemes for clustering quantitative association rules withtwo uninstantiated numeric attributes in the left-hand side

are presented in [11]. For a given support and confidence, the

BRIN ET AL.: MINING OPTIMIZED GAIN RULES FOR NUMERIC ATTRIBUTES 325


3/15

authors present a clustering algorithm to generate a set ofnonoverlapping rectangles such that every point in eachrectangle has the required confidence and support. Ourschemes, on the other hand, compute an optimal set ofnonoverlapping rectangles with the maximum gain. Further,in our approach, we only require that each rectangle in the

optimal set have minimum confidence; however, individual

points in a rectangle may not have the required confidence.Recent work on histogram construction, presented in [ 9],

is somewhat related to our optimized rule computationproblem. In [9], the authors propose a dynamic program-

ming algorithm to compute V-optimal histograms for asingle numeric attribute. The problem is to split theattribute domain into k buckets such that the sum squarederror over the k buckets is minimum. Our algorithm forcomputing the optimized gain rule (for one attribute) differsfrom the histogram construction algorithm of [9] in anumber of respects. First, our algorithm attempts tomaximize the gain, which is very different from minimizingthe sum squared error. Second, histogram construction

typically involves identifying bucket boundaries, while ouroptimized gain problem requires us to compute optimalregions (that may not share a common boundary). Finally,our algorithm has a linear time dependency on the size ofthe attribute domainin contrast, the histogram construc-tion algorithm of [9] has a time complexity that is quadraticin the number of distinct values of the attribute underconsideration.

In [4], the authors propose a general framework foroptimized rule mining, which can be used to express ouroptimized gain problem as a special case. However, thegenerality precludes the development of efficient algo-

rithms for computing optimized rules. Specifically, theauthors use a variant of Dense-Miner from [5] which

essentially relies on enumerating optimized rules in orderto explore the search space. Since there are an exponentialnumber of optimized rules, the authors propose pruningstrategies to reduce the search space and, thus, improve theefficiency of the search. However, in the worst case, thetime complexity of the algorithm from [5] is still exponentialin the size of attribute domains. In contrast, our algorithmsfor computing the optimized gain rule have polynomialtime complexity (linear complexity for one numeric

attribute) since they exploit the specific properties of gainand one and two-dimensional spaces.

3 PROBLEM FORMULATION

In this section, we define the optimized association rules

problem addressed in the paper. The data is assumed to be

stored in a relation defined over categorical and numeric

attributes. Association rules are built from atomic condi-

tions, each of which has the formAi vi(Aicould be either

categorical or numeric) and Ai2 li; ui (only if Ai is

numeric). For the atomic condition Ai2 li; ui, if li and uiare values from the domain ofAi, the condition is referred

to asinstantiated; otherwise, if they are variables, we refer to

the condition asuninstantiated.

Atomic conditions can be combined using operators^or

_ to yield more complex conditions. Instantiated association

rules, which we study in this paper, have the formC1 ! C2,

whereC1 and C2 are arbitrary instantiated conditions. Let

the support for an instantiated condition C, denoted by

supC, be the ratio of the number of tuples satisfying the

conditionCand the total number of tuples in the relation.Then, for the association rule R:C1 ! C2,supRis defined

assupC1 andconfR is defined assupC1^C2

supC1 . Note that our

definition of supR is different from the definition in [2],

wheresupR was defined to be supC1^ C2. Instead, we

have adopted the definition of support used in [6], [7], [14],

[15]. Also, let minConf denote the user-specified minimum

confidence. Then, gainR is defined to be the difference

betweensupC1^ C2 and minConf times supC1. In other

words,gainRis

supC1^ C2 minConf supC1

supR confR minConf:The optimized association rule problem requires optimal

instantiations to be computed for an uninstantiated associa-tion rule that has the form: U^ C1 ! C2, where U is aconjunction of one or two uninstantiated atomic conditionsover distinct numeric attributes andC1 and C2are arbitraryinstantiated conditions. For simplicity, we assume that thedomain of an uninstantiated numeric attributeis f1; 2; . . . ; ng.Depending on the number, one or two, of uninstantiatednumeric attributes, consider a one or two-dimensional spacewithanaxisforeachuninstantiatedattributeandvaluesalongeachaxis corresponding to increasingvalues fromthe domainof the attributes. Note that if we consider a single interval in

the domain of each uninstantiated attribute, then theircombination results in a region. Forthe one-dimensional case,thisregion l1; u1 issimplytheintervall1; u1 for the attribute;for the two-dimensionalcase, the region l1; l2; u1; u2 istherectangle bounded along each axis by the endpoints of theintervals l1; u1 and l2; u2 along the two axis.

Suppose, for a region R l1; u1, we define confR,supR, andgainRto beconf,sup, andgain, respectively,for the rule

A1 2 l1; u1 ^ C1 ! C2

(similarly, for R l1; l2; u1; u2, confR, supR, andgainR are defined t o b e conf, sup, a nd gain for

A1 2 l1; u1 ^ A2 2 l2; u2 ^ C1 ! C2). In addition, for a setof nonoverlapping regions,

S fR1; R2; . . . ; Rjg; Ri li1; ui1;

suppose we define confS, supS, and gainS to be theconf, sup, and gain, respectively, of the rule

_ji1A1 2 li1; ui1 ^ C1 ! C2:

For two dimensions, in which case each

Ri li1; li2; ui1; ui2;

confS,supS, andgainS are defined to be the conf,sup,

andgain, respectively, of the rule

326 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003


4/15

_ji1A12 li1; ui1 ^ A2 2 li2; ui2 ^ C1! C2:

Then, since R1; . . . ; Rj are nonoverlapping regions, thefollowing hold for set S:

supS supR1 supRj

confS supR1 confR1 supRj confRj

supR1 supRj

gainS gainR1 gainRj:

Having defined the above notation, we present below theformulation of the optimized association rule problem forgain.

Problem Definition (Optimized Gain).Givenk, determine aset Scontaining at most k regions such that, for each regionRi2 S, confRi minConf andgainS is maximized.

We refer to the set Sas the optimized gain set.

Example 3.1. Consider the telecom service providerdatabase (discussed in Section 1) containing call detaildata for a one week period. Fig. 1 presents the summaryof the relation for the seven daysthe summaryinformation includes, for each date, the total # of callsmade on the date, the # of calls from NY, and the # of

calls fromNY to France. Also included in the summaryare the support, confidence, and gain, for each date v, ofthe rule

date v ^ src city NY ! dst country France:

The total number of calls made during the week is 2,000.Suppose we are interested in discovering the interest-

ing periods with heavy call volume fromNYtoFrance(aperiod is a range of consecutive days). Then, thefollowing uninstantiated association rule can be used.

date2 l; u ^ src city NY! dst country France:

In the above rule, U is date2 l; u,C1 is src city NY

andC2is dst country France. Let us assume that we areinterested in at most two periods (that is, k 2) withminConf 0:50. An optimized gain set is f5; 5; 7; 7g; werequire up to two periods such that the percentage of callsduring each of the periods from NY that are to France isatleast 50 percent and the gain is maximized. Of the possibleperiods 1; 1, 5; 5, and 7; 7, the gain in period 5; 5 is12:5 103 and both 1; 1 and 7; 7 have gains of2:5 103. Thus, both f5; 5; 7; 7g and f1; 1; 5; 5g areoptimized gain sets.

In the remainder of the paper, we shall assume that thesupport, confidence, and gain for every point in a region are

availablethese can be computed by performing a single

pass over the relation. The points, along with their supports,confidences, and gains, thus constitute the input to ouralgorithms. Thus, the input size isnfor the one-dimensionalcase, while, for the two-dimensional case, it is n2.

4 ONENUMERIC ATTRIBUTE

In this section, we tackle the problem of computing theoptimized gain set when association rules contain a singleuninstantiated numeric attribute. Thus, the uninstantiatedrule has the form: A12 l1; u1 ^ C1 ! C2, where A1 is the

uninstantiated numeric attribute. We propose an algorithmwith linear time complexity for computing the optimizedgain set (containing up to k nonoverlapping intervals) inSection 4.2. But first, in Section 4.1, we present preproces-sing algorithms for collapsing certain contiguous ranges ofvalues in the domain of the attribute into a single bucket,thus reducing the size of the input n.

4.1 Bucketing

For the one-dimensional case, each region is an interval andsince the domain size is n, the number of possible intervalsis On2. Now, suppose we could split the range 1; 2; . . . ; ninto b buckets, where b < n, and map every value in A1s

domain into one of the b buckets to which it belongs. Then,the new domain ofA1 becomesf1; 2; . . . ; bgand the numberof intervals to be considered becomes Ob2which couldbe much smaller, thus reducing the time and spacecomplexity of our algorithms. Note that the reduction inspace complexity also results in reduced memory require-ments for our algorithms.

In the following, we present a bucketing algorithm that1) does not compromise the optimality of the optimized set(that is, the optimized set computed on the buckets isidentical to the one computed using the raw domain values)and 2) has time complexity On. The output of thealgorithm is the b buckets with their supports, confidences,

and gains and this becomes the input to the algorithm forcomputing the optimized gain set in Section 4.2.For optimized gain sets, we begin by making the

following simple observationvalues in A1s domainwhose confidence is exactly minConf have a gain of 0 andcan be thus ignored. Including these values in theoptimized gain set does not affect the gain of the set and,so, we can assume that, for every value in f1; 2; . . . ; ng,either the confidence is greater than minConf or less thanminConf.

The bucketing algorithm for optimized gain collapsescontiguous values whose confidence is greater thanminConf into a single bucket. It also combines contiguous

values each of whose confidence is less than minConfinto a


Fig. 1. Summary of call detail data for a one week period.


5/15

single bucket. Thus, for any interval assigned to a bucket, itis the case that either all values in the interval haveconfidence greater thanminConfor all values in the intervalhave confidence less than minConf.

For instance, let the domain of A1 be f1; 2; . . . ; 6g andconfidences of 1, 2, 5, and 6 be greater than minConf, whileconfidences of 3 and 4 are less than minConf. This isillustrated in Fig. 2 with symbols and indicating apositive and negative gain, respectively, for domain values.

Then, our bucketing scheme generates three bucketsthefirst containing values 1 and 2, the second 3 and 4, and thethird containing values 5 and 6. It is straightforward toobserve that assigning values to buckets can be achieved byperforming a single pass over the input data and thus haslinear time complexity.

In order to show that the above bucketing algorithm doesnot violate the optimality of the optimized set, we use theresult of the following theorem.

Theorem 4.1. Let S be an optimized gain set. Then, for anyintervalu; vin S, it is the case that

confu 1; u 1< minConf;

confv 1; v 1< minConf;confu; u> minConf;

andconfv; v> minConf.

Proof. Note that confu; v> minConf. As a result, ifconfu 1; u 1> minConf, t he n confu 1; v>minConfand, since gainu 1; u 1> 0,

gainu 1; v> gainu; v:

Thus, the set S fu; vg [ fu 1; vg has higher gainand is the optimized gain setthus leading to acontradiction. A similar argument can be used to showthatconfv 1; v 1< minConf.

On the other hand, if

confu; u< minConf;then gainu; u< 0 and, since confu; v> minConf,confu 1; v> minConf. Also,

gainu 1; v> gainu; v

and, thus, the set S fu; v g [ f u 1; vg has highergain and is the optimized gain setthus, leading to acontradiction. A similar argument can be used to showthatconfv; v> minConf. tu

From the above theorem, it follows that if u; v is aninterval in the optimized set, then valuesuandu 1cannotboth have confidences greater than or less than minConfthesame holds forvalues v and v 1.Thus,forasetofcontiguousvalues, if the confidence of each and every value is greaterthan (or is less than) minConf, then the optimized gain seteither contains all of the values or none of them. Thus, aninterval in the optimized set either contains all the values in abucket or none of themas a result, the optimized set can becomputed using the buckets instead of the original values inthe domain.

4.2 Algorithm for Computing Optimized Gain Set

In this section, we present an Obk algorithm for theoptimized gain problem for one dimension. The input to thealgorithm is the b buckets generated by our bucketingscheme in Section 4.1 along with their confidences,

supports, and gains. The problem is to determine a set ofat most k (nonoverlapping) intervals such that the con-fidence of each interval is greater than or equal to minConfand gain of the set is maximized.

Note that, due to our bucketing algorithm, bucketsadjacent to a bucket with positive gain have negative gainand vice versa. Thus, if there are at most k buckets withpositive gain, then these buckets constitute the desired


Fig. 2. Example of buckets generated.

Fig. 3. Algorithm for computing optimized gain set.


6/15

optimized gain set. Otherwise, procedure optGain1D, shownin Fig. 3, is used to compute the optimized set. For anintervalI, we denote by maxI, the subinterval ofIwithmaximum gain. Also, we denote by minI, the subintervalof Iwhose gain is minimum. Note that, for an interval I,minI andmaxI can be computed in time that is linear inthe size of the interval. This is due to the following dynamicprogramming relationship for the gain of the subinterval ofIwith the maximum gain and ending at point u (denoted bymaxu):

maxu maxfgainu; u;maxu 1 gainu; ug:

(A similar relationship can be derived for the subinterval

with minimum gain).Thek desired intervals are computed by optGain1D in k

iterationsthe ith iteration computes the i intervals withthe maximum gain using the results of the i 1th iteration.After the i 1th iteration, PSet is the optimized gain setcontainingi 1intervals, while the remaining intervals notin PSet are stored in NSet. After Pq and Nq have beencomputed, as described in Steps 3-4, if

gainminPq gainmaxNq< 0;

then it follows that the gain of minPq is more negativethan the gain ofmaxNqis positive. Thus, the best strategyfor maximizing gain is to split Pq into two subintervals

using minPqas the splitting interval and include the twosubintervals in the optimized gain set (Steps 6-8). On theother hand, ifgainminPq gainmaxNq 0, then thegain can be maximized by adding maxNq to theoptimized gain set (Steps 11-13). Note that if PSet/NSetis empty, then we cannot compute Pq=Nqand, so,

gainminPq=gainmaxNq

in Step 5 is 0.

Example 4.2.Consider the six buckets 1; 2; . . . ; 6with gains10, -15, 20, -15, 20, and -15 shown in Fig. 4a. We trace theexecution of optGain1D assuming that we are interestedin computing the optimized gain set containing two

intervals.

Initially, NSet is set tof1; 6g(see Fig. 4a). During thefirst iteration of optGain1D, Nqis1; 6since it is the onlyinterval in NSet. Furthermore, maxNq 3; 5(the darksubinterval in Fig. 4a) and gainmaxNq 25. Since

PSet is empty, gainminPq 0 and Nq is split intothree intervals 1; 2, 3; 5, and 6; 6, of which 3; 5 isadded to PSet and 1; 2 and 6; 6 are added to NSet (afterdeleting1; 6from it). The sets PSet and NSet at the endof the first iteration are depicted in Fig. 4b.

In thesecond iteration, Pq 3; 5 (minPq 4; 4)and

Nq 1; 2 (maxNq 1; 1) (since gainmax1; 2 10

is larger thangainmax6; 6 15). Thus, since

gainminPq gainmaxNq 5;

3; 5 is split into three intervals 3; 3, 4; 4, and 5; 5, of

which 3; 3 and 5; 5 are added to PSet (after deleting

3; 5 from it), which is the desired optimized gain set.

The dark subintervals in Fig. 4c denotes the minimum

and maximum gain subintervals of Pq and Nq, respec-

tively, and the final intervals in PSet and NSet (after the

second iteration) are depicted in Fig. 4d.

We can show that the above simple greedy strategy

computes thei intervals with the maximum gain (in the ith

iteration). We first show that, after the ith iteration, the

intervals in PSet and NSet satisfy the following conditions(let the i intervals in PSet be P1; . . . ; Pi and the remaining

intervals in NSet be N1; . . . ; Nj).

. Cond 1. Let u; v be an interval in PSet. For allu l v, gainu; l 0 and gainl; v 0.

. Cond 2. Let u; v be an interval in NSet. For allu l v, gainu; l 0 (except when u 1) andgainl; v 0 (except when v b).

. Con d 3 . F or all 1 l i, 1 m j, gainPlgainmaxNm.

. Cond 4. For all 1 l i, 1 m j, gainminPl gainNm(except for Nm that contains one of the

endpoints, 1 or b).


Fig. 4. Execution trace of procedure optGain1D. (a) Before first iteration, (b) after first iteration, (c) before second iteration, and (d) after seconditeration.


7/15

. Cond 5. For all 1 l; m i, l6m, gainminPl gainPm 0.

. Cond 6.For all 1 l; m j, l 6m,gainmaxNl gainNm 0 (except for Nm that contain one of theendpoints, 1 or b).

For an interval u; vin PSet or NSet, Conditions 1 and 2

state properties about the gain of its subintervals that

contain u or v. Simply put, they state that extending or

shrinking the intervals in PSet does not cause its gain toincrease. Condition 3 states that the gain of PSet cannot be

increased by replacing an interval in PSet by one contained

in NSet, while Conditions 4 and 5 state that splitting an

interval in PSet and merging two other adjacent intervals in

it or deleting an interval from it cannot increase its gain

either. Finally, Condition 6 covers the case in which two

adjacent intervals in PSet are merged and an additional

interval from NSet is added to itCondition 6 states that

these actions cannot cause PSets gain to increase.

Lemma 4.3. After the ith iteration of procedure optGain1D, the

intervals in PSet and NSet satisfy Conditions 1-6.

Proof.See the Appendix. tu

We can also show that any set ofi intervals (in PSet) that

satisfies all of the six above conditions is optimal with

respect to gain.

Lemma 4.4.Any set ofiintervals satisfying Conditions 1-6 is an

optimized gain set.

Proof.See the Appendix. tu

From the above two lemmas, we can conclude that, at the

end of theith iteration, procedure optGain1D computes the

optimized gain set containing i intervals (in PSet).

Theorem 4.5. Procedure optGain1D computes the optimizedgain set.

It is straightforward to observe that the time complexity ofprocedure optGain1D isObksince it performs k iterationsand in each iteration, intervalsPqandNqcan be computed inOb steps.

5 TWONUMERIC ATTRIBUTES

We next consider the problem of mining the optimized gainset for the case when there are two uninstantiated numericattributes. In this case, we need to compute a set of knonoverlapping rectangles in two-dimensional space whosegain is maximum. Unfortunately, this problem in NP-hard[10]. In the following section, we describe a dynamicprogramming algorithm with polynomial time complexitythat computes approximations to optimized sets.

5.1 Approximation Algorithm Using DynamicProgramming

The procedure optGain2D (see Fig. 5) for computingapproximate optimized gain sets is a dynamic program-ming algorithm that uses simple end-to-end horizontal andvertical cuts for splitting each rectangle into two subrec-tangles. Procedure optGain2D accepts as input parameters,the coordinates of the lower left (i; j) and upper right(p; q) points of the rectangle for which the optimized set isto be computed. These two points completely define therectangle. The final parameter is the bound on the numberof rectangles that the optimized set can contain. The arrayoptSeti; j; p; q; kis used to store the optimized set withsize at mostk for the rectangle, thus preventing recomputa-tions of the optimized set for the rectangle. The confidence,

support, and gain for each rectangle is precomputedthis


Fig. 5. Dynamic programming algorithm for compputing optimized gain set.


8/15

can be done in On4 steps, which is proportional to the

total number of rectangles possible.In optGain2D, the rectangle i; j; p; qis first split into

two subrectangles using vertical cuts (Steps 6-13), and later

horizontal cuts areemployed(Steps 14-21). For k > 1, vertical

cuts betweeni and i 1,i 1and i 2; . . . ; p 1and p are

used to divide rectangle i; j; p; q into subrectangles

i; j; l; q and l 1; j; p; q forall i l p 1. Forevery

pair of subrectangles generated above, optimized sets of size

k1 and k2 are computed by recursively invoking optGain2D

for all k1; k2 such that k1 k2 k. An optimization can be

employed in casek 1 (Step 7) and, instead of considering

every vertical cut, it suffices to only consider the vertical cuts

at the ends since the single optimized rectangle must be

contained in eitheri; j; p 1; qor i 1; j; p; q. After

similarly generating pairs of subrectangles using horizontal

cuts, the optimized set for the original rectangle is set to the

union of the optimized sets for the pair with the maximum

gain (function maxGainSet returns theset with themaximum

gain from among its inputs).

Example 5.1. Consider the two-dimensional rectangle0; 0; 2; 2 in Fig. 6 for two numeric attributes, eachwith domain f0; 1; 2g. We trace the execution ofoptGain2D for computing the optimized gain set

containing two nonoverlapping rectangles.Consider the first invocation of optGain2D. In thebody of the procedure, the variablelis varied from 0 to 1for both vertical and horizontal cuts (Steps 10 and 18).Further, the only value for variablem is 1 since k is 2 inthe first invocation. Thus, in Steps 10-12 and 18-20, therectangle0; 0; 2; 2is cut at two points in each of thevertical and horizontal directions and optGain2D iscalled recursively for the two subrectangles due to eachcut. These subrectangle pairs for the horizontal and

vertical cuts are illustrated in Figs. 7 and 8, respectively.Continuing further with the execution of the first

recursive call of optGain2D with rectangle 0; 0; 0; 2

and k 1, observe that the boundary points of therectangle along the horizontal axis are the same. As aresult, since k 1, in Step 16, optGain2D recursivelyinvokes itself with two subrectangles, each of whose sizeis one unit smaller along the vertical axis. Thesesubrectangles are depicted in Fig. 9. The process ofrecursively splitting rectangles is repeated for the newlygenerated subrectangles as well as rectangles due toprevious cuts.

The number of points input to our dynamic program-

ming algorithm for the two-dimensional case is Nn2

since n is the size of the domain of each of the two

uninstantiated numeric attributes.


Fig. 6. Rectangle0; 0; 2; 2.

Fig. 7. Vertical cuts for optGain2D((0,0),(2,2),2).

Fig. 8. Horizontal cuts for optGain2D((0,0),(2,2),2).


9/15

Theorem 5.2. The time complexity of Procedure optGain2D is

ON2:5k2.

Proof. The complexity of procedure optGain2D is simply

the number of times procedure optGain2D is invoked

multiplied by a constant. The reason for this is that steps

in optGain2D that do not have constant overhead (e.g.,

the for loops in Steps 12 and 13) result in recursive calls

to optGain2D. Thus, the overhead of these steps is

accounted for in the count of the number of calls to

optGain2D. Consider an arbitrary rectangle i; j; p; q

and consider an arbitrary 1 l k. We show that

optGain2D is invoked with the above parameters at

most4nktimes. Thus, since the number of rectangles is at

mostn4 andl can take k possible values, the complexity

of the algorithm is On5k2.We now show that optGain2D with a given set of

parameters can be invoked at most 4nk times. The firstobservation is that optGain2D with rectangle i; j; p; q

and l is invoked only from a different invocation ofoptGain2D with a rectangle that on being cut verticallyor horizontally yields rectangle i; j; p; q and with avalue fork that lies betweenl andk. The number of suchrectangles is at most 4neach of these rectangles can beobtained fromi; j; p; qby stretching it in one of fourdirections and there are onlyn possibilities for stretchinga rectangle in any direction. Thus, each invocation ofoptGain2D can result from 4nk different invocations ofoptGain2D. Furthermore, since the body of optGain2Dfor a given set of input parameters is executed only once,optGain2D with a given input is invoked at most 4nktimes. tu

5.2 Optimality ResultsProcedure optGain2Ds approach of splitting each rectangle

into two subrectangles and then combining the optimized

sets for each subrectangle may not yield the optimized set

for the original rectangle. This point is further illustrated in

Fig. 10a that shows a rectangle and the optimized set of

rectangles for it. It is obvious that there is no way to split the

rectangle into two subrectangles such that each rectangle in

the optimized set is completely contained in one of the

subrectangles. Thus, a dynamic programming approach

that considers all possible splits of the rectangle into two

subrectangles (using horizontal and vertical end-to-end

cuts) and then combines the optimized sets for the

subrectangles may not result in the optimized set for the

original rectangle being computed.In the following, we first identify restrictions under

which optGain2D yields optimized sets. We then show

bounds on how far the computed approximation for the

general case can deviate from the optimal solution.

Let us define a set of rectangles to be binary spacepartitionableif it is possible to recursively partition the plane

such that no rectangle is cut and each partition contains at

most one rectangle. The set of rectangles in Fig. 10b is

binary space partitionable (the bold lines are a partitioning

of the rectangles)however, the set in Fig. 10a is not.If we are willing to restrict the optimized set to only

binary space partitionable rectangles, then we can show that

procedure optGain2D computes the optimized set. Note

that any set of three or fewer rectangles in a plane is always

binary space partitionable. Thus, for k 3, optGain2D

computes the optimized gain set.

Theorem 5.3.Procedure optGain2D computes the optimized setof binary space partitionable rectangles.

Proof.The proof is by induction on the size of the rectangles

that optGain2D is invoked with.Basis. For all 1 l k, optGain2D can be trivially

shown to compute the optimized binary space partition-able set for the unit rectangle i; i; i; i (if confidence ofthe rectangle is at leastminConf, then the optimized set isthe rectangle itself).

Induction. We next show that, for any 1 l k andrectangle i; j; p; q, the algorithm computes theoptimized binary space partitionable set (assuming that,for all its subrectangles, for all 1 l k, the algorithm

computes the optimized binary space partitionable set).We need to consider three cases. The first is when theoptimized set is the rectangle i; j; p; q itself. In thiscase, confi; j; p; q minConf and optSet is cor-rectly set to fi; j; p; qg by optGain2D. In case l 1then, since confi; j; p; q< minConf, the optimizedrectangle must be contained in one of the four largestsubrectangles ini; j; p; qand, thus, the optimized setof size 1 for these subrectangles whose support ismaximum is the optimized set for i; j; p; q. Finally,if l >1 then, since we are interested in computing theoptimized binary space partitionable set, there must exista horizontal or vertical cut that cleanly partitions the

optimized rectangles, one of the two subrectangles due


Fig. 9. Smaller rectangles recursively considered by optGain2D

((0,0),(0,2),1). Fig. 10. Binary space partitionable rectangles.


10/15

to the cut containing at most 1 r < l and the othercontaining l r rectangles of the optimized set. Thus,since, in optGain2D, all possible cuts are considered andoptSet is then set to the union of the optimized sets forthe subrectangles such that the resulting gain ismaximum, due to the induction hypothesis, it followsthat this is the optimized gain set for the rectangle

i; j; p; q. tu

We next use this result in order to show that, in thegeneral case, the approximate optimized gain set computedby procedure optGain2D is within a factor of 1

4 of the

optimized gain set. The proof also uses a result from [1] inwhich it is shown that, for any set of rectangles in a plane,there exists a binary space partitioning (that is, a recursivepartitioning) of the plane such that each rectangle is cut intoat most four subrectangles and each partition contains atmost one subrectangle.

Theorem 5.4.Procedure optGain2D computes an optimized gainset whose gain is greater than or equal to 1

4 times the gain of

the optimized gain set.Proof. From the result in [1], it follows that it is possible to

partition each rectangle in the optimized set into foursubrectangles such that the set of subrectangles is binaryspace partitionable. Furthermore, for each rectangle,consider its subrectangle with the highest gain. The gainof each such subrectangle is at least 1

4times the gain for the

original rectangle. Thus, the set of these subrectangles isbinary space partitionable and has 1

4 of the gain of the

optimized set. As a result, due to Theorem 5.3 above, itfollowsthattheoptimizedsetcomputedbyoptGain2Dhasgain that is at least 1

4times the gain of the optimized set.tu

6 EXPERIMENTAL RESULTS

In this section, we study the performance of our algorithmsfor computing optimized gain sets for the one-dimensionaland two-dimensional cases. In particular, we show that ouralgorithm is highly scaleable for one dimension. Forinstance, we can tackle attribute domains with sizes ashigh as one million in a few minutes. For two dimensions,however, the high time and space complexities of ourdynamic programming algorithm make it less suitable forlarge domain sizes and a large number of disjunctions. Wealso present results of our experiments with a real-lifepopulation survey data set where optimized gain sets

enable us to discover interesting correlations amongattributes.

In our experiments, the data file is read only once at thebeginning in order to compute the gain for every point. Thetime for this, in most cases, constitutes a tiny fraction of thetotal execution time of our algorithms. Thus, we do notinclude the time spent on reading the data file in our results.Furthermore, note that the performance of our algorithmsdoes not depend on the number of tuples in the data fileitis more sensitive to the size of the attributes domain n andthe number of intervalsk. We fixed the number of tuples inthe data file to be 10 million in all our experiments. Ourexperiments were performed on a Sun Ultra-2/200 machine

with 512 MB of RAM and running Solaris 2.5.

6.1 Performance Results on Synthetic Data SetsThe association rule that we experimented with has theform U^ C1 ! C2, where U contains one or two unin-stantiated attributes (see Section 3) whose domains consistof integers ranging from 1 to n. Every instantiation ofU^C1 ! C2 (that is, point in m-dimensional space) is assigneda randomly generated confidence between 0 and 1 withuniform distribution. Each value in m-dimensional space isalso assigned a randomly generated support between 0 and

2nm with uniform distribution; thus, the average support fora value is 1nm.

6.1.1 One-Dimensional Data

Bucketing. We begin by studying the reduction in inputsize due to the bucketing optimization. Table 1 illustratesthe number of buckets for domain sizes ranging from 500 to100,000 whenminConfis set to 0.5. From the table, it followsthat bucketing can result in reductions to input size as highas 65 percent.

Scale-up with n. The graph in Fig. 11a plots theexecution times of our algorithm for computing optimizedgain sets as the domain size is increased from 100,000 to1 million for a minConf value of 0.5. Note that, for thisexperiment, we turned off the bucketing optimizationsothe running times would be even smaller if we were toemploy bucketing to reduce the input size. The experiments

validate our earlier analytical results on the Obk timecomplexity of procedure optGain1D. As can be seen fromthe Fig. 11, our optimized gain set algorithm scales linearlywith the domain size as well as k.

Sensitivity to minConf. Fig. 11b depicts the runningtimes for our algorithm for a range of confidence values anda domain size of 500,000. From the graphs, it follows thatthe performance of procedure optGain1D is not affected byvalues for minConf.

6.1.2 Two-Dimensional Data

Scale-up withn . The graph in Fig. 12a plots the executiontimes of our dynamic programming algorithm for comput-

ing optimized gain sets as the domain sizes n n areincreased from 10 10 to 50 50 in increments of 10. Thevalue ofminConfis set to 0.5 and, in the figure, each point isrepresented along the x-axis with the value ofN n2 (e.g.,10 10 is represented as 100). Note that, for this experi-ment, we do not perform bucketing since this optimizationis applicable only to one-dimensional data. The runningtimes corroborate our earlier analysis of the time complexityofON2:5k2 for procedure optGain2D. Note that, due to thehigh space complexity ofON2k for our dynamic program-ming algorithm, we could not measure the execution timesfor large values of N and k. For these large parametersettings, the system returned an out-of-memory error

message.


TABLE 1Values of b for Different Domain Sizes


11/15

Sensitivity to minConf. Fig. 12b depicts the running timesof our algorithm for a range of confidence values and adomain size of30 30. From the graph, it follows that theperformance of procedure optGain2D improves with in-creasing valuesfor minConf.Thereasonforthisisthat,athighconfidence values, fewer points have positive gain and, thus,optSet, in most cases, is either empty or contains a smallnumber of rectangles. As a result, operations on optSet likeunion (Steps 12 and 20) and assignment (Steps 8, 12, 16, 20,and22) aremuch fasterand, consequently, theexecution timeof the procedure is lower for high confidence values.

6.2 Experiences with Real-Life DatasetsIn order to gauge the efficacy of our optimized rule frame-work for discovering interesting patterns, we conductedexperiments with a real-life data set consisting of the currentpopulation survey (CPS) data for the year 1995.1 The CPS isamonthly survey of about 50,000 households and is theprimary source of information on the labor force character-istics of the US population. The CPS data consists of a varietyof attributes that include age (A_AGE), group healthinsurance coverage (COV_HI), hours of work (HRS_WK),household income (HTOTVAL), temporary work experiencefor a few days (WTEMP), and unemployment compensation

benefits received Y/N-Person (UC_YN). The number ofrecords in the 1995 survey is 149,642.

Suppose we are interested in finding the age groups that

contain a large concentration of temporary workers. This

may be of interest to a job placement/recruiting company

that is actively seeking temporary workers. Knowledge of

the demographics of temporary workers can help the

company target its advertisements better and locate

candidate temporary workers more effectively. This corre-

lation between age and temporary worker status can be

computed in our optimized rule framework using the

following rule: A AGE2 l; u ! WTEMP YES. T heoptimized gain regions found with minConf = 0.01

(1 percent) for this rule is presented in Table 2. (The

domain of the A_AGE attribute varies from 0 to 90 years.)

From the table, it follows that there is a high concentration

of temporary workers among young adults (age between 15

and 23) and seniors (age between 62 and 69). Thus,

advertising on television programs or Web sites that are

popular among these age groups would be an effective

strategy to reach candidate temporary workers. Also,

observe that the improvement in the total gain slows for

increasing k until it reaches a point where increasing the

number of gain regions does not affect the total gain.


1. This data can be downloaded from http://www.bls.census.gov/cps.

Fig. 12. Performance results for two-dimensional data. (a) Scale-up with N. (b) Sensitivity to minConf.

Fig. 11. Performance results for one-dimensiomal data. (a) Scale-up with n. (b) Sensitivity to minConf.


12/15

We also ran our algorithm to find optimized gain regionsfor the rule A AGE2 l; u ! UC YN NO with minConf= 0.95 (95 percent). The results of this experiment arepresented in Table 3 and can be used by the government,for example, to design appropriate training programs forunemployed individuals based on their age. The time tocompute the optimized set for both real-life experimentswas less than 1 second.

7 CONCLUDING REMARKS

In this paper, we generalized the optimized gain associa-

tion rule problem by permitting rules to contain up to

k disjunctions over one or two uninstantiated numeric

attributes. For one attribute, we presented an Onk

algorithm for computing the optimized gain rule, where

n is t he number of value s in t he domain of the

uninstantiated attribute. We also presented a bucketing

optimization that coalesces contiguous valuesall of

which have confidence either greater than the minimum

specified confidence or less than the minimum confidence.

For two attributes, we presented a dynamic programming

algorithm that computes approximate gain rulesweshowed that the approximations are within a constant

factor of the optimized rule using recent results on binary

space partitioning. For a single numeric attribute, our

experimental results with synthetic data sets demonstratethe effectiveness of our bucketing optimization and the

linear scale-up for our algorithm for computing optimized

gain sets. For two numeric attributes, however, the high

time and space complexities of our dynamic program-

ming-based approximation algorithm make it less suitablefor large domain sizes and large number of disjunctions.

Finally, our experiments with the real-life population

survey data set indicate that our optimized gain rulescan indeed be used to unravel interesting correlations

among attributes.

APPENDIX

Proof of Lemma 4.3. We use induction to show that, after

iteration i, the intervals in PSet and NSet satisfy

Conditions 1-6.Basis(i 0). The six conditions trivially hold initially

before the first iteration begins.Induction Step. Let us assume that thei 1intervals

in PSet satisfy Conditions 1-6 after the i 1th iteration

completes. Let P1; . . . ; Pi1 be the i 1 intervals in PSetand N1; . . . ; Nj be the intervals in NSet after the i 1thiteration. During each iteration, either Pqis split into twosubintervals using minPq, which are then added toPSet, ormaxNqis added to PSet. Both actions result inthree subintervals that satisfy Conditions 1 and 2. Forinstance, let interval Pq u; v be split into threeintervals and let s; t be minPq. For all u l s 1,it must be the case that gain(l; s 1) 0 since,otherwise, s; t would not be the interval with theminimum gain. Similarly, it can be shown that, for alls l t, it must be the case that gains; l 0 since,otherwise, s; t would not be the minimum gain

subinterval inu; v. The other cases can be shown usinga similar argument.

Next, we show that splitting intervalPq u; v in Plistinto three subintervals with s; t = minu; v as themiddle interval preserves the remaining four conditions.

. Cond 3.We need to show thatgainu; s 1andgaint 1; v gainmaxs; t, gainu; s 1,and

gaint 1; v gainmaxNm;

and gainPl gainmaxs; t. Wecan showthat

gains; t gainmaxs; t 0 since, otherwise,

the subinterval ofs; t preceding maxs; t wouldhave smaller gain thans; t (every subinterval of

s; t with endpoint at s has gain 0 and, so, the

sum of the gains ofmaxs; t and the subinterval

of s; t preceding maxs; t is 0). Since every

subinterval ofu; v withuas an endpoint has gain

0, it follows that

gainu; s 1 gains; t 0:

Thus, it follows that

gainu; s 1 gainmaxs; t:

Similarly, we can show that


TABLE 2Optimized Gain Regions for age2 l; u ! WTEMP YES

TABLE 3Optimized Gain Regions for age2 l; u ! UC YN NO


13/15

gaint 1; v gainmaxs; t:

Note that, since Pq u; v is chosen for

splitting, it must be the case that

gainmaxNm gains; t 0:

Also, since gainu; s 1 gains; t 0, we

can deduce that

gainu; s 1 gainmaxNm:

Using an identical argument, it can be shown that

gaint 1; v gainmaxNm.Finally, for every Pl, gainPl gains; t 0

(due to Condition 5). Combining this with

gains; t gainmaxs; t 0;

we obtain that gainPl gainmaxs; t.. Cond 4. The preservation of this condition

follows because gainminPl gains; t sinces; tis chosen such that it has the minimum gain

from among all the minPl. The same argumentcan be used to show that gainminu; s 1andgainmint 1; v gains; t. Also, sincegains; t gainNm (due to Condition 4) ands; t is the subinterval of Pq with the minimumgain, it follows that gainminu; s 1 and

gainmint 1; v gainNm:

. Cond 5. gainminu; s 1 gainPm 0 sincegains; t gainPm 0 (due to Condition 5)andgainminu; s 1 gains; t (s; t is thesubinterval with the minimum gain in u; v).Similarly, we can show that

gainmint 1; v gainPm 0:

Also, we need to show that

gainminPl gainu; s 1 0:

This follows sincegainu; s 1 gains; t 0

and gainminPl gains; t. Similarly, it can

be shown that

gainminPl gaint 1; v 0:

. Cond 6.Sinces; tis used to splitPq, it is the case

that

gainmaxNm gains; t 0:

We next need to show that

gainmaxs; t gainNm 0:

Due to Condition 4, gains; t gainNm and,

as shown in the proof for Condition 3 above,

gains; t gainmaxs; t 0. Combining the

two, we obtaingainmaxs; t gainNm 0.

Next, we show that splitting interval Nq u; v inNlist into three intervals with s; t maxu; v as themiddle interval preserves the remaining four conditions.

. Cond 3.Due to Condition 3,gainPl gains; tand gains; t gainmaxu; s 1 since s; tis the subinterval with maximum gain in u; v.Thus,gainPl gainmaxu; s 1 and we canalso similarly show that

gainPl gainmaxt 1; v:

Also, since s; t has the maximum gain fromamongmaxNm, it follows that

gains; t gainmaxNm:

. Cond 4. We first show that gainmins; t gains; t 0 since, if

gainmins; t gains; t< 0;

then the subinterval of s; t preceding mins; twould have gain greater than s; t (since everysubinterval ofs; tbeginning atshas gain greaterthan or equal to 0, it follows that the sum of the

gains ofmins; tand the subinterval precedingmins; t in s; t is 0). Also, since everysubinterval of Nq beginning at u has gain lessthan or equal to 0 (assuming u 61), we get

gainu; s 1 gains; t 0:

Combining the two, we get gainmins; t gainu; s 1(except when u; s 1contains anendpoint). Using an identical argument, we canalso show that gainmins; t gaint 1; v(except whent 1; vcontains an endpoint).

Also, due to Condition 6,

gains; t gainNm 0;

which, when combined with

gainmins; t gains; t 0;

implies that gainmins; t gainNm (exceptwhen Nm contains an endpoint). Finally, sinces; t is used to split interval Nq, it follows thatgainminPl gains; t 0. Furthermore, ifu61, we have gainu; s 1 gains; t 0.Combining the two, we get gainminPl gainu; s 1 (except when u; s 1 containsan endpoint). Similarly, we can show that

gainminPl gaint 1; v (except when t 1; vcontains an endpoint).. Cond 5. Since Nq is the interval chosen for

splitting, gainminPl gains; t 0. Due toCondition 3, gainPl gains; tand, as shownin the proof of Condition 4 above,

gainmins; t gains; t 0;

from which we can deduce that

gainmins; t gainPl 0:

. Cond 6. Since Nq is the interval chosen for

splitting,gainmaxNl gains; t. Also, since



14/15

gainu; s 1 gains; t 0

(except whenu 1), we get

gainmaxNl gainu; s 1 0

(except when u; s 1 contains an endpoint).Similarly, we can show that gainmaxNl gaint 1; v 0 (except whent 1; vcontainsan endpoint).

Again, since s; t is used to split Nq,gainmaxu; s 1 gains; t. Also, due toCondition 6, gains; t gainNm 0, thusyielding

gainmaxu; s 1 gainNm 0

(except when Nm contains an endpoint). Using asimilar argument, it can be shown that

gainmaxt 1; v gainNm 0

(except whenNm contains an endpoint). tu

Proof of Lemma 4.4. LetP1; P2; . . . ; Pibe i intervals in PSetsatisfying Conditions 1-6. In order to show that this is anoptimizedgain set, we show that thegain of every other setof i intervals is no larger than that of P1; P2; . . . ; Pi.Consider any set ofi intervals PSet0 P01; P

02; . . . ; P

0i . We

transform these set of intervals in a series of steps toP1; P2; . . . ; Pi. Each step ensures that the gain of thesuccessive set of intervals is at least as high as thepreceeding set. As a result, it follows that the gain ofP1; P2; . . . ; Piisat least ashighas any other set ofi intervalsand, thus, fP1; . . . ; Pig is an optimized gain set.

The steps involved in the transformation are asfollows:

1. For an interval P0j u; v that intersects one ofthe Pls do the following: If u2 Pl for somePl s; t, then if s; u 1 does not intersect anyother P0ms, modify P

0j to be s; v (that is, delete

u; v from PSet0 and add s; v to PSet0). Notethat, due to Condition 1, gains; u 1 0 and,so, gains; v gainu; v. Similarly, if v2 Pl,for some Pl s; t, then if v 1; t does notintersect any other P0ms, modify P

0j to be u; t.

On the other hand, if, for an interval P0j u; vthat intersects one of the Pls,u 62Plfor allPl, thenlet m be the max value such that u; m does notintersect with any of the Pls. Modify P

0

j

to bem 1; v. N ot e t ha t, d ue t o C on di ti on 2 ,gainu; m 0 and, so,

gainm 1; v gainu; v:

Similarly, ifv 62Plfor allPl, then let m be the minvalue such that m; vdoes not intersect with anyof the Pls. Modify P0j to be u; m 1.

Thus, at the end of Step 1, each interval P0j inPSet0(some of which may have been modified)either does not intersect any Pl or, if it doesintersect an interval Pl, then each endpoint ofP

0j

lies in some interval Pm and each endpoint ofPl

lies in some interval P

0

m. Also, note that if two

intervalsPl andP0

j overlap and intersect with no

other intervals, then at the end of Step 1, P0j Pl.2. In this step, we transform all P0js in PSet

0 thatintersect multiple Pls. Consider a P

0j u; v that

intersects multiple Pls ( th at i s, s pan s a nNm s; t). Thus, since there are k intervals,we need to consider two possible cases: 1) There

is a P0m that does not intersect with any Pl and2) some Pl intersects with multiple P

0ms. For

Case 1, due to Condition 6, it follows thatgainP0m gains; t 0 and, thus, deleting P

0m

from PSet0 and splitting P0j into u; s 1 andt 1; v (that is, deleting u; v from PSet0 andadding u; s 1 and t 1; v to it) does notcause the resulting gain of PSet0 to decrease. ForCase 2, due to Condition 4, it follows thatmerging any two adjacent intervals that intersectwith Pl and splitting P

0j into u; s 1 and t

1; v does not cause the gain of PSet0 to reduce.This procedure can be repeated to get rid of all

P0

js in PSet0

that overlap with multiple Pls. Atthe end of Step 2, each P0j in PSet0 overlaps with

at most one Pl. As a result, for every Pl thatoverlaps with multiple P0js or every P

0j that does

not overlap with any Pl, there exists a Pm thatoverlaps with no P0js.

3. Finally, consider the Pms that do not intersectwith any of the P0j s in PSet

0. We need to considertwo possible cases: 1) There is a P0j that does notintersect with anyPl, or 2) somePlintersects withmultiple P0js. For Case 1, due to Condition 3, itfollows thatgainPm gainP

0j and, thus, delet-

ingP0j and addingPm toPSet0 does not cause the

overall gain ofPSet0 to decrease. For Case 2, dueto Condition 5, it follows that merging any twoadjacent intervals P0j that intersect with Pl andaddingPm to PSet

0 does not cause PSet0s gain toreduce. This procedure can be repeated untilevery Pl intersects with exactly one P

0j, thus

making them identical. tu

ACKNOWLEDGMENTS

Without the support of Yesook Shim, it would have been

impossible to complete this work. The work of Kyuseok

Shim was partially supported by the Korea Science and

Engineering Foundation (KOSEF) through the Advanced

Information Technology Research Center (AITrc).

REFERENCES[1] F.D. Amore and P.G. Franciosa, On the Optimal Binary Plane

Partition for Sets of Isothetic Rectangles, Information ProcessingLetters,vol. 44, no. 5, pp. 255-259, Dec. 1992.

[2] R. Agrawal, T. Imielinski, and A. Swami, Mining AssociationRules between Sets of Items in Large Databases, Proc. ACMSIGMOD Conf. Management of Data, pp. 207-216, May 1993.

[3] R. Agrawal and R. Srikant, Fast Algorithms for MiningAssociation Rules,Proc. VLDB Conf., Sept. 1994.

[4] R.J. Bayardo and R. Agrawal, Mining the Most InterestingRules, Proc. ACM SIGKDD Conf. Knowledge Discovery and Data

Mining,1999.



15/15

[5] R.J. Bayardo, R. Agrawal, and D. Gunopulos, Constraint-BasedRule Mining in Large, Dense Databases, Proc. Intl Conf. DataEng., 1997.

[6] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, MiningOptimized Association Rules for Numeric Attributes, Proc. ACMSIGACT-SIGMOD-SIGART Symp. Principles of Database Systems,

June 1996.[7] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data

Mining Using Two-Dimensional Optimized Association Rules:

Scheme, Algorithms, and Visualization, Proc. ACM SIGMODConf. Management of Data, June 1996.[8] J. Han and Y. Fu, Discovery of Multiple-Level Association Rules

from Large Databases, Proc. VLDB Conf., Sept. 1995.[9] H.V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.

Sevcik, and T. Suel, Optimal Histograms with Quality Guaran-tees,Proc. VLDB Conf., Aug. 1998.

[10] S. Khanna, S. Muthukrishnan, and M. Paterson, On Approximat-ing Rectangle Tiling and Packing, Proc. Ninth Ann. Symp. Discrete

Algorithms (SODA),pp. 384-393, 1998.[11] B. Lent, A. Swami, and J. Widom, Clustering Association Rules,

Proc. Intl Conf. Data Eng., Apr. 1997.[12] H. Mannila, H. Toivonen, and A. Inkeri Verkamo, Efficient

Algorithms for Discovering Association Rules, Proc. AAAIWorkshop Knowledge Discovery in Databases (KDD-94), pp. 181-192, July 1994.

[13] J.S. Park, M.-S. Chen, and P.S. Yu, An Effective Hash Based

Algorithm for Mining Association Rules, Proc. ACM-SIGMODConf. Management of Data, May 1995.[14] R. Rastogi and K. Shim, Mining Optimized Association Rules for

Categorical and Numeric Attributes, Proc. Intl Conf. Data Eng.,1998.

[15] R. Rastogi and K. Shim, Mining Optimized Support Rules forNumeric Attributes,Proc. Intl Conf. Data Eng., 1999.

[16] R. Srikant and R. Agrawal, Mining Generalized AssociationRules,Proc. VLDB Conf., Sept. 1995.

[17] R. Srikant and R. Agrawal, Mining Quantitative AssociationRules in Large Relational Tables, Proc. ACM SIGMOD Conf.

Management of Data,June 1996.[18] A. Savasere, E. Omiecinski, and S. Navathe, An Efficient

Algorithm for Mining Association Rules in Large Databases,Proc. VLDB Conf., Sept. 1995.

Sergey Brin received the bachelor of science

degree with honors in mathematics and compu-ter science from the University of Maryland atCollege Park. He is currently on leave from thePhD program in computer science at StanfordUniversity, where he received his mastersdegree. He is currently cofounder and presidentof Google, Inc. He met Larry Page at Stanfordand worked on the project that became Google.Together, they founded Google, Inc. in 1998. He

is a recipient of a US National Science Foundation Graduate Fellowship.He has been a featured speaker at a number of national andinternational academic, business, and technology forums, includingthe Academy of American Achievement, European Technology Forum,Technology, Entertainment and Design, and Silicon Alley, 2001. Hisresearch interests include search engines, information extraction fromunstructured sources, and data mining of large text collections andscientific data. He has published more than a dozen publications in

leading academic journals, including Extracting Patterns and Relationsfrom the World Wide Web;Dynamic Data Mining: A New Architecture forData with High Dimensionality, which he published with Larry Page;Scalable Techniques for Mining Casual Structures; Dynamic ItemsetCounting and Implication Rules for Market Basket Data; and BeyondMarket Baskets: Generalizing Association Rules to Correlations.

Rajeev Rastogi received the BTech degree incomputer science from the Indian Institute ofTechnology, Bombay, in 1988 and the mastersand PhD degrees in computer science from theUniversity of Texas, Austin, in 1990 and 1993,respectively. He is the director of the InternetManagement Research Department at BellLaboratories, Lucent Technologies. He joinedBell Laboratories in Murray Hill, New Jersey, in

1993 and became a distinguished member ofthe technical staff (DMTS) in 1998. Dr. Rastogi is active in the field ofdatabases and has served as a program committee member for severalconferences in the area. His writings have appeared in a number of ACMand IEEE publications and other professional conferences and journals.His research interests include database systems, storage systems,knowledge discovery, and network management. His most recentresearch has focused on the areas of network management, datamining, high-performance transaction systems, continuous-media sto-rage servers, tertiary storage systems, and multidatabase transactionmanagement.

Kyuseok Shim received the BS degree inelectrical engineering from Seoul National Uni-versity in 1986 and the MS and PhD degrees incomputer science from the University of Mary-land, College Park, in 1988 and 1993, respec-tively. He is currently an assistant professor atSeoul National University in Korea. Previously,he was an assistant professor at the KoreaAdvanced Institute of Science and Technology(KAIST), Korea. Before joining KAIST, he was a

member of the technical staff (MTS) and one of the key contributors tothe Serendip data mining project at Bell Laboratories. Before that, heworked on the Quest Data Mining project at the IBM Almaden ResearchCenter. He also worked as a summer intern for two summers at HewlettPackard Laboratories. Dr. Shim has been working in the area ofdatabases focusing on data mining, data warehousing, query processingand query optimization, XML and, semistructured data. He is currentlyan advisory committee member for ACM SIGKDD and an editor of theVLDB Journal. He has published several research papers in prestigiousconferences and journals. He has also served as a program committeemember of the ICDE 97, KDD 98, SIGMOD 99, SIGKDD 99, andVLDB00 conferences. He did a data mining tutorial with Rajeev Rastogiat ACM SIGKDD 99 and a tutorial with Surajit Chaudhuri on storage andretrieval of XML data using relational DB at VLDB 01.

.For more information on this or any computing topic, please visitour Digital Library at http://computer.org/publications/dlib.


mining optimized gain rules

Documents