author's personal copy - unicalscalab.dimes.unical.it/papers/pdf/miningdistributeddata...j grid...

J Grid ComputingDOI 10.1007/s10723-013-9277-0

A Multi-Domain Architecture for Mining Frequent Itemsand Itemsets from Distributed Data Streams

Eugenio Cesario · Carlo Mastroianni ·Domenico Talia

Received: 9 October 2012 / Accepted: 17 September 2013© Springer Science+Business Media Dordrecht 2013

Abstract Real-time analysis of distributed datastreams is a challenging task since it requires scal-able solutions to handle streams of data that aregenerated very rapidly by multiple sources. Thispaper presents the design and the implementationof an architecture for the analysis of data streamsin distributed environments. In particular, datastream analysis has been carried out for the com-putation of items and itemsets that exceed a fre-quency threshold. The mining approach is hybrid,that is, frequent items are calculated with a sin-gle pass, using a sketch algorithm, while frequentitemsets are calculated by a further multi-passanalysis. The architecture combines parallel anddistributed processing to keep the pace with therate of distributed data streams. In order to keepcomputation close to data, miners are distrib-uted among the domains where data streams are

E. Cesario (B) · C. MastroianniICAR-CNR, Via P. Bucci 41C,Rende (CS) 87036, Italye-mail: [email protected]

C. Mastroiannie-mail: [email protected]

D. TaliaICAR-CNR and DIMES, University of Calabria,Via P. Bucci 41C, Rende (CS) 87036, Italye-mail: [email protected]

generated. The paper reports the experimental re-sults obtained with a prototype of the architecture,tested on a Grid composed of three domains eachone handling a data stream.

Keywords Distributed data mining · Frequentitems · Frequent itemsets · Grid · Stream mining

1 Introduction

Mining data streams is a very important researchtopic and has recently attracted a lot of attention,because in many cases data is generated by exter-nal sources so rapidly that it may be unfeasibleto store and analyze it offline. Moreover, in somecases streams of data must be analyzed in real timeto provide information about trends, outlier val-ues or regularities that must be signaled as soon aspossible. The need for online computation is a no-table challenge with respect to classical data min-ing algorithms [2, 19]. Important application fieldsfor stream mining are as diverse as financial appli-cations, network monitoring, security problems,telecommunication networks, Web applications,sensor networks, analysis of atmospheric data, etc.

A further difficulty occurs when streams aredistributed, and mining models must be derivednot only from a single stream, but from multipleand heterogeneous data streams [11]. This sce-nario can occur in all the application domains

Author's personal copy

E. Cesario et al.

mentioned before. For example, in a Content Dis-tribution Network, user requests delivered to aWeb system can be forwarded to any of severalservers located in different and possibly distantplaces, in order to serve requests more efficientlyand balance the load. In such a context, theanalysis of user requests, for example to discoverfrequent patterns, must be performed with theinspection of data streams detected by differentservers. Another notable application field is theanalysis of packets processed by routers of an IPnetwork. In any distributed scenario, it is essentialthat miners are located as close to data sourcesas possible, in order to limit the overhead of datacommunication. When there is the need for per-forming multiple passes on data, the presence ofdata cachers can help, provided that they are alsoappropriately distributed.

Sometimes the rate of a single data stream canbe so fast that a single computing node can havedifficulties to keep the pace with the generationof data. In these cases, it can be useful to sam-ple the data stream instead of processing all thedata [15], but of course this can lower the accuracyof derived models, depending on the samplingfrequency and the adopted algorithm. A different,or complementary, solution is to partition a datastream among a set of miners, so that each minerprocesses only a fraction of data. This solution canbe achieved by parallelizing the computation overthe nodes of a cluster or a high-speed computernetwork, or can also be implemented by exploit-ing the multiple CPUs/GPUs offered by modernmulticore and manycore machines.

Two important and recurrent problems regard-ing the analysis of data streams are the computa-tion of frequent items and frequent itemsets fromtransactional datasets. The first problem is verypopular both for its simplicity and because it isoften used as a subroutine for more complex prob-lems. The goal is to find, in a sequence of items,those whose frequency exceeds a specified thresh-old. When the items are generated in the form oftransactions—sets of distinct items—it is also use-ful to discover frequent sets of items. A k-itemset,i.e., a set of k distinct items, is said to be fre-quent if those items concurrently appear in a givenfraction of transactions. The discovery of fre-quent itemsets is essential to cope with many data

mining problems, such as the computation of asso-ciation rules, classification models, data clusters,etc. This task can be severely time consuming,since the number of candidates is combinatorialwith their allowed size. The usually adopted tech-nique is to first discover frequent items, and thenbuild candidate itemsets incrementally, exploitingthe Apriori property [4], which states that an i-itemset can be frequent only if all of its subsets arealso frequent. While there are some proposals inthe literature to mine frequent itemsets in a singlepass, it is recognized that in the general case, inwhich the generation rate is fast, it is very difficultto solve the problem without allowing multiplepasses on the data stream [22].

The architecture we designed and present inthis paper addresses the issues mentioned aboveby exploiting the following main features:

– it combines the parallel and distributed para-digms, the first one to keep the pace with therate of a single data stream, by using multipleminers (processors or cores), the second oneto cope with the distributed nature of datastreams. Miners are distributed among the do-mains where data streams are generated, inorder to keep computation close to data.

– the computation of frequent items is per-formed through sketch algorithms. These algo-rithms maintain a matrix of counters, and eachitem of the input stream is associated with aset of counters, one for each row of the table,through hash functions. The statistical analysisof counter values allows item frequencies tobe estimated with the desired accuracy. Sketchalgorithms compute a linear projection of theinput: thanks to this property, sketches ofdata can be computed separately for differentstream sources, and can then be integrated toproduce the overall sketch [13].

– the approach is hybrid, meaning that frequentitems are calculated online, with a single pass,while frequent itemsets are calculated by a fur-ther multi-pass analysis. This approach allowsimportant information to be derived on the flywithout imposing too strict time constraints onmore complex tasks, such as the extraction offrequent k-itemsets, as this could excessivelylower the accuracy of models.


A Multi-Domain Architecture for Mining from Distributed Data Streams

Moreover, the hybrid approach improves theflexibility of the architecture, which can beused both when the computation of frequentitemsets is required and when only frequentitems are needed.

– to support the mentioned hybrid approach,the architecture exploits the presence of datacachers on which recent data can be stored.In particular, miners can turn to data cachersto retrieve the statistics about frequent itemsand use them to identify frequent sets of items.To avoid excessive communication overhead,data cachers are distributed and placed closeto stream sources and miners.

To the best of our knowledge, this is one ofthe first attempts to combine the four mentionedcharacteristics. In particular, we are not aware ofattempts to combine the parallel and distributedparadigms in stream mining, nor of implementedsystems that adopt the hybrid single-pass/multi-pass approach, though this kind of strategy is sug-gested and fostered in the recent literature [31].The major advantages of the proposed architec-ture are its scalability and flexibility. Indeed, thearchitecture can efficiently exploit the presenceof multiple miners, when this is required by theamount of computation and the generation rate ofstream data. And the model can be easily adaptedto the requirements of specific scenarios: (i) thepull approach allows the algorithm to work withany set of miners, as it is the miners that declaretheir availability to perform part of the work;(ii) the use and degree of parallelism can be ad-justed depending on the stream rate and miners’capabilities; (iii) the model can be quickly adaptedto the cases in which only frequent items are to becomputed, or both frequent items and itemsets areneeded.

Beyond presenting the architecture, we de-scribe an implemented prototype and discuss a setof experiments performed in a Grid environmentcomposed of three domains each one handlinga data stream. In this scenario, we computedfrequent items and itemsets for two well knowndatasets, Kosarak and WebDocs, and analyzed theprocessing time changing the number of minersavailable in each domain and the rate of the datastreams.

This work extends the contribution presentedin [7] and offers deeper insights concerning thescalability properties, the amount of exchangeddata and the computational efficiency of the de-veloped architecture. The results of experimentson a real distributed platform are reported toanalyze the mentioned aspects. The paper is struc-tured as follows: Section 2 summarizes the mainissues regarding the mining of data streams andillustrates the most common algorithms for thecomputation of frequent items and itemsets; Sec-tion 3 describes the proposed parallel/distributedarchitecture for mining data streams and discussesthe adopted hybrid approach; Section 4 presentsthe prototype and the testbed scenario, and re-ports the related results; Section 5 discusses re-lated work and Section 6 concludes the paper.

2 Mining Frequent Items and Frequent Itemsetsin Distributed Data Streams

Data stream analysis is often performed with ran-domized and approximated algorithms, since ex-act and deterministic algorithms would requiretoo much computing time and memory space. Ac-cordingly, data mining algorithms for data streamanalysis are generally evaluated with respect tothree metrics [13]:

– Processing time needed to update the datastructures and the mining models after thearrival of new stream items;

– Storage space used by the algorithm;– Accuracy of the approximated algorithm, in

general specified through two parameters setby the user: the accuracy parameter ε andthe failure probability δ, which means that theestimation error is at most ε with probability(1 − δ). Of course, processing time and storagesize strongly depend on these parameters.

As mentioned, the discovery of frequent items [12]is a very important task, consisting of identifyingthe items whose frequency in a stream exceedsa specified fraction σ of the overall stream size.This problem has a huge number of applicationsin a variety of scenarios: frequent items can be themost popular destinations of IP packets, the mostfrequent queries submitted to a search engine,


E. Cesario et al.

the most common values observed by sensors ina wireless environment, etc. Moreover, frequentitems are often used as the basis for more complexanalysis processes.

The problem is formalized as follows [13]:

Problem Statement Given a stream S of n itemse1, ..., en, where the frequency of an item i is fi =|{ j|e j = i}|, and a frequency threshold σ , the ε-approximate-frequent-items problem consists offinding the set F of items such that: F = {i| fi ≥(σ − ε)n}

Two basic categories of algorithms can be usedto solve this problem: the Counter-Based and theSketch-Based algorithms. The algorithms of thefirst type maintain counters for a subset of ele-ments, and counters are updated every time oneof these elements is observed in the stream. Ifthe observed element has no associated counter,the algorithm must choose whether to ignore theelement or replace an existing counter with acounter for the new item. At the end of the firstpass, frequent items will surely be among thoseassociated with counters, but the inverse is nottrue, which requires at least a second pass to verifywhich counters actually correspond to frequentitems. Some of the most used Counter-Based al-gorithms are the SpaceSaving algorithm [28] andthe LossyCounting algorithm [26].

Conversely, Sketch-Based algorithms [12] donot monitor a subset of elements but provide,with a given accuracy, an estimation of the fre-quency for all stream elements using a matrix ofcounters C with d rows and w columns. A setof d hash functions h1, ..., hd are chosen amonga family of pairwise-independent functions, andare associated to the different matrix rows. Eachitem i observed in the stream is mapped, foreach row r, to the matrix element C[r, hr(i)]. Thiscounter is then modified depending on the specificsketch algorithm: in the sketch-based CountMinalgorithm [14], at the arrival of a new item i, thecounter is incremented as follows (see Fig. 1):

f or r ∈ [1, d] → C[r, hr(i)]+ = 1

The number of counters in a row, w, islower than the number of elements, so there areconflicts, because several distinct elements will bemapped by a hash function to the same counter.

Fig. 1 CountMin algorithm. A new item is associated,for each row, to a different entry—computed with a hashfunction—which is then incremented

However, different elements are in conflict fordifferent rows, which enables the adoption of sta-tistical techniques to estimate the actual frequen-cies of elements. In CountMin, collisions alwayscause extra increments of counters, therefore thebest estimation for the frequency fi of element i isthe minimum value of the counters associated to i:

fi = minr{C[r, hr(i)]}

Of course, the accuracy of sketch algorithmsincreases with the size of the matrix, since alarger matrix reduces the frequency of collisionsof different elements on the same counter. InCountMin, setting d = ⌈

ln 1δ

⌉and w = ⌈ e

ε

⌉en-

sures that fi, in a stream with n elements, has errorat most εn with probability of at least 1 − δ. Thespatial complexity is O

( eε

ln 1δ

)while the time for

update is O(

ln 1δ

).

More details about CountMin can be foundin [14]. Here it is worth recalling that this algo-rithm, as all the sketch algorithms, has the impor-tant property that the sketch is a linear projectionof the input. This means that the overall sketchof multiple streams can be computed by addingthe sketches of single streams. This is the mainreason why we decided to adopt CountMin: in adistributed architecture, it would be prohibitive totransmit source data to a central processing node,while the mere transmission of sketch summariesallows communication overhead to be drasticallyreduced.

The computation of frequent itemsets can ei-ther be performed directly, or by exploiting thestatistics of frequent items. The direct computa-tion in a single pass is feasible only if the stream



rate is moderate, due to the large number of can-didate frequent itemsets. For example, in [22] ahybrid approach is used: first, a counter-based al-gorithm computes the candidate 2-itemsets, then asecond pass is necessary to eliminate the false can-didates, finally the Apriori property is exploited tofind the frequent i-itemsets, for i > 2. The meritof hybrid approaches is that they try to combinethe best of single-pass and multiple-pass algo-rithms [31], and can be particularly efficient in adistributed scenario. In our architecture, frequentitems are computed on the fly with CountMin, andthe results, stored in distributed data cachers, areused to compute frequent itemsets.

3 A Hybrid Multi-Domain Architecture

This section presents the stream mining archi-tecture that aims at solving the problem ofcomputing frequent items and frequent item-sets from distributed data streams, exploiting ahybrid single-pass/multiple-pass strategy. We as-sumed that stream sources, though belonging todifferent domains, are homogenous, so that it isuseful to extract knowledge from their union. Typ-ical cases are the analysis of the traffic experi-enced by several routers of a wide area network,or the analysis of client requests forwarded tomultiple web servers. Miner nodes are locatedclose to the streams, so that data transmitted be-tween different domains only consists of models(sketches), not raw data.

The architecture, depicted in Fig. 2, includesthe following components:

– Data Streams (DS), located in differentdomains.

– Miners (M). They are placed close to the re-spective Data Streams, and perform two ba-sic mining tasks: the computation of sketchesfor the discovery of frequent items, and thecomputation of the support count of candidatefrequent itemsets. If a single Miner is unableto keep the pace of the local DS, the streamitems can be partitioned and forwarded to aset of Miners, which operate in parallel. EachMiner computes the sketch only for the data itreceives, and then forwards the results to the

Fig. 2 Distributed architecture for data stream mining

local Stream Manager. Parallel Miners can beassociated to the nodes of a cluster or a highspeed computer network, or to the cores of amanycore machine.

– Stream Managers (SM): in each domain, theStream Manager collects the sketches com-puted by local miners, and derives the sketchfor the local DS. Moreover, each SM cooper-ates with the Stream Manager Coordinator tocompute global statistics, valid for the unionof all the Data Streams.

– Stream Manager Coordinator (SMC): thisnode collects mining models from differentdomains and computes overall statistics re-garding frequent items and frequent itemsets.The SMC can coincide with one of the StreamManagers, and can be chosen with an electionalgorithm. In the figure, the SM of the domainon the left also takes the role of SMC.

– Data Cachers (DC) are essential to enablethe hybrid strategy and the computation offrequent itemsets, when this is needed. EachData Cacher stores the statistics about fre-quent items discovered in the local domain.These results are then re-used by Miners todiscover frequent itemsets composed of in-creasing numbers of items.

The algorithm for the computation of frequentitems, outlined in Fig. 3, is performed continu-ously, for each new block of data generated bythe data streams. A block is defined here as theset of transactions that are generated in a time


E. Cesario et al.

Fig. 3 Schema of the algorithm for mining frequent items

interval P. The algorithm includes the followingsteps, also shown in the figure:

1. a filter is used to partition the block into asmany sub-blocks as the number of availableMiners;

2. each Miner computes the sketch related to thereceived sub-block;

3. the Miner transmits the sketch to the SM,which overlaps the sketches, thanks to the lin-earity property of sketch algorithms, and ex-tracts the frequent items for the local domain;

4. two concurrent operations are executed:every SM sends the local sketch to the SMC(step 4a), and the Miners send the mostrecent blocks of transactions to the local DataCacher (step 4b). The last operation is onlyneeded when frequent itemsets are to becomputed, otherwise it can be skipped;

5. the SMC aggregates the sketches received bySMs and identifies the items that are frequentfor the union of data streams.

Frequent items are computed for a windowcontaining the most recent W blocks. This can bedone easily thanks to the linearity of the sketchalgorithm: at the arrival of a new block, the sketchof this block is added to the current sketch ofthe window, while the sketch of the least recentblock is subtracted. The window-based approachis common because most interesting results aregenerally related to recent data [16].

Sketch-based algorithms are only capable ofcomputing frequent items. To discover frequentitemsets, it is necessary to perform multiplepasses on data. Candidate k-itemsets are con-structed starting from frequent (k–1)-itemsets.More specifically, at the first step candidate 2-itemsets are all the possible pairs of frequentitems: Miners must compute the support for thesepairs to determine which of them are frequent. Inthe following steps, a candidate k-itemset is ob-tained by adding any frequent item to the frequent(k–1)-itemsets. Thanks to the Apriori property,candidates can be pruned by checking if all thek–1 subsets are frequent: a k-itemset can be fre-quent only if all the subsets are frequent.

The approach allows us to compute both item-sets that are frequent for a single domain andthose that are frequent for the union of distrib-uted streams. Figure 4 shows an example of howfrequent 3-itemsets are computed. The top partof the figure reports items and 2-itemsets thatare frequent for the two considered domains andfor the whole system. The candidate 3-itemsets,computed by the two SMs and by the SMC, arethen reported, before and after the pruning basedon the Apriori property. In the bottom part, thefigure reports the support counts computed forthe two domains and for the whole system. Fi-nally, the SMs check which candidates exceed thespecified threshold (in this case, set to 10 %):notice that the {abc} itemset is frequent globallythough it is locally frequent in only one of thetwo domains. In general, it can happen that anitemset occurs frequently for a single domain andinfrequently globally, or vice versa: therefore, it isnecessary to separately perform the two kinds ofcomputations.

The schema of the algorithm for mining fre-quent itemsets is illustrated in Fig. 5, which as-sumes that the steps indicated in Fig. 3 have al-ready been performed.

The successive steps are:

6. each SM builds the candidate k-itemsets forthe local domain (6a), and the SMC alsobuilds the global candidate k-itemsets (6b);

7. the SMC sends the global candidates to theSMs for the computation of their support atthe different domains;



Fig. 4 Example of thecomputation of frequent3-itemsets

8. SMs send both local and global candidates tothe Miners;

9. Miners turn to the Data Cacher to retrievethe transactions included in the current win-dow (this operation is performed only atthe first iteration of the algorithm). Then,the Miners compute the support count forall the candidates, using the window-awaretechnique presented in [20];

10. Miners transmit the results to the local SM;11. the SM aggregates the support counts re-

ceived by Miners and selects the k-itemsetsthat are frequent in the local domain;

12. analogously, the SMs send the SMC the sup-port counts of the global candidates;

Fig. 5 Schema of the algorithm for mining frequentitemsets

13. the SMC computes the itemsets that are fre-quent over the whole system. At this point,the algorithm restarts from step 6 to findfrequent itemsets with increasing numbersof items. The cycle stops either when themaximum allowed size of itemsets is reachedor when no frequent itemset was found in thelast iteration.

4 Prototype and Performance Evaluation

The architecture described in the previous sec-tion was implemented starting from the Min-ing@Home system. Mining@Home, a Java-basedframework partly inspired by the Public Com-puting paradigm, was adopted to perform sev-eral classes of data mining computations, amongwhich the analysis of astronomical data to searchfor gravitational waves [27], and the discoveryof closed frequent itemsets with parallel algo-rithms [24]. The main features of the stream min-ing prototype inherited from Mining@Home, arethe pull approach (Miners are assigned jobs onthe basis of their availability) and the adoption ofData Cachers to store reusable data. Moreover,some important modifications were necessary toadapt the framework to the stream mining sce-nario. For example, the selection of the Minersthat are the most appropriate to perform the min-ing tasks is subject to vicinity constraints, because


E. Cesario et al.

in a streaming environment it is very importantthat the analysis of data be performed close to thedata source. Another notable modification is theadoption of the hybrid approach for the single-pass computation of frequent items and the multi-pass computation of frequent itemsets.

Experiments were performed on a Grid com-posed of two remote ICAR networks (ICAR-CSlocated in Cosenza and ICAR-NA in Naples, 300kilometers away) and one network owned by theUniversity of Calabria (UNICAL), in Cosenza.The topology of the Grid, as well as the inter-domain and intra-domain transfer rates, are de-picted in Fig. 6. The Miners and the Stream Man-agers were installed on the nodes of clusters whilethe Data Sources and the Data Cachers wereput on different nodes, external to the clusters.The UNICAL cluster has twelve Cpu Intel XeonE5520 nodes with four 2.27 GHz processors and24 GB RAM; the ICAR-CS cluster has twelveIntel Itanium nodes with two 1.5 GHz CPU and4 GB RAM; finally, the ICAR-NA cluster is anHP XC6000 with 64 nodes equipped with two1.4 GHz CPU and 8 GB RAM. All the nodes run

Linux, and the software components are writtenin Java.

To assess the prototype, we used the trans-actional datasets published by the FIMI Reposi-tory [18]. Some of these datasets are originatedby data streams, so they are appropriate for ouranalysis. In particular:

• The “kosarak” dataset contains a list of click-streams generated by users of an online portal.The analysis of user visits can be useful toidentify the most popular sections of the por-tal, the preferences and requirements of users,etc.;

• the “webDocs” dataset is generated from a setof Web pages. Each page, after the applicationof a filtering algorithm, is represented witha set of significant words included in it. Theanalysis of most frequent words, or sets ofwords, can be useful to devise caching policies,indexing techniques, etc.

Basic information about the two datasets issummarized below:

Dataset MB No. of No. of distinct Size of tuplestuples items (no. of items)

min med maxkosarak 30.5 990002 41270 1 8 2498webDocs 1413 1692082 5267656 1 177 71472

The parameters used to assess the prototypeare listed below:

– P: the time interval to receive a block of data.This interval determines the average numberof transactions generated within a block, de-noted as Nt, and the average size of a block inbytes, B;

– NMD: the number of available miners per do-main. In our experiments, this number is thesame for the three domains;

– NM: the total number of available miners inthe Grid, equal to 3 · NMD;

– FCPU : the fraction of CPU time reserved onminers for the experiments, set to 30 %.This setting was used to make the results

independent from the execution of otherprocesses on the same nodes.1

– S: the support threshold used to determinefrequent items and itemsets;

– W: the size of the sliding window, i.e., thenumber of consecutive blocks of data on whichcomputation is performed;

– CM: the capacity of the miner buffer. Unlessotherwise stated, it is equal to the size of a datablock B.

– ε and δ, the accuracy parameters of the sketchalgorithm, which are both set to 0.01.

1the fraction of CPU is tuned using the program “cpulimit”,http://cpulimit.sourceforge.net.


http://cpulimit.sourceforge.net.


Fig. 6 Topology of theGrid used forexperimental evaluation

– the maximum size of candidate itemsets, set to5 for Kosarak and to 8 for Webdocs.

Prior to the experiments, the datasets are re-trieved and stored on the Data Source nodes ofthe three domains. During the experiments, datais sent to local Miners with a specified transferrate, to reproduce the original data stream. Thetransmission rate is adjusted by setting the para-meter Nt, the number of transactions generatedduring the time interval P.

The main performance index assessed duringthe experiments is the execution time, defined asthe time interval between the transmission of anew block of stream data and the time at which theanalysis of this block has been completed by theStream Manager Coordinator (SMC). If this valueis not longer than the time interval P, it meansthat the system is able to keep the pace with dataproduction.

4.1 Experiments with the Dataset Kosarak

The experiments were executed assuming a timeperiod P equal to 15 s. The generation rateswere compatible with the Web sites of Wikipedia,Microsoft and Ebay, as estimated using the Website http://www.webtraffic24.com. These rates cor-respond, respectively, to values of Nt equal toabout 40,000, 30,000 and 20,000 transactions perblock. The generation rates were equally parti-tioned between the three domains. These are veryhigh generation rates, and allowed the prototypeto be tested in challenging conditions.

Figures 7 and 8 report the execution time ex-perienced for the computation of frequent itemsexclusively (I), and for the computation of bothfrequent items and itemsets (I+IS), vs. the totalnumber of miners NM. The execution time wasaveraged over 20 time periods in order to have amore robust statistical relevance. In these experi-ments, S was set to 0.02, the Miner cache size CM

was set to the average size of a block B, and the

window size W was set to 5. Plots are reportedfor three different values of Nt. Both figures showthat the processing time decreases as the numberof miners increases, which is a sign of the goodscalability of the architecture. Scalable behavior isensured by two main factors: the linearity prop-erty of the sketch algorithm, and the placement ofData Cachers close to the miners. In this evalua-tion the value NM = 3 (or NMD = 1) correspondsto the case in which the mining computation issequential on each domain, i.e., it is performed bya single node.

The system is stable when the execution timeis lower than the time period P (15 s): in such acase, the system is able to keep the pace with thegeneration of stream data. This condition is alwaysverified when the system is only asked to computefrequent items, as is clear from Fig. 7. On theother hand, the computation of frequent itemsetsis much more time consuming. The dashed linedepicted in Fig. 8 corresponds to the time periodP, and it is shown to easily check in which casesthe system is stable. Results show that a singleminer per domain is not sufficient: depending on

0

500

1000

1500

2000

2500

3000

3 6 9 12 15 18 21 24 27 30 33 36

Tim

e (m

s)

Number of miners, NM

I; Nt=20000I; Nt=30000I; Nt=40000

Fig. 7 Analysis of Kosarak: execution time for the compu-tation of frequent items (I), vs. the number of miners, fordifferent values of the number of transactions per block, Nt


http://www.webtraffic24.com

E. Cesario et al.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

3 6 9 12 15 18 21 24 27 30 33 36

Tim

e (m

s)


P=15000 msI+IS; Nt=20000I+IS; Nt=30000I+IS; Nt=40000

Fig. 8 Analysis of Kosarak: execution time for the com-putation of frequent items (I) and itemsets (IS), vs. thenumber of miners, for different values of the number oftransactions per block, Nt

the generation rate, three, five, or six miners perdomain are needed to keep the processing timebelow the period length P.

To better assess the system behavior it is use-ful to analyze the data traffic involved in thecomputation. Figure 9 shows the amount of datatransmitted over the network at the generationof a new block of data, for different numbers ofminers per domain. In these experiments, Nt wasset to 30, 000 while the other parameters wereset as detailed before. The first three groups ofbars show the overall amount of data transferred

between nodes of type A to nodes of type B ina single domain, denoted as A→B. For example,DS→M is the amount of data transmitted by theData Source to the Miners of a single domainat every time period. The values of DS→M andM→DC are equal since each Miner sends to thelocal DC the data received from the DS. Theforth group reports DTDomain, the overall amountof data transferred within a single domain, com-puted as the sum of the contributions of the firstthree groups (the contribution of the first groupis considered twice). The contribution SM→SMCis the amount of data transferred between remotedomains, i.e., between the Stream Managers andthe Stream Manager Coordinator. Finally, the lastgroup of bars reports the amount of data trans-ferred over the whole network, DTNet. This iscomputed as the term DTDomain times the num-ber of domains—in this case 3—plus the termSM→SMC.

It is interesting to notice that the contributionDC→M decreases when the number of minersper domain increases, and becomes null whenNMD is equal or greater than 5. This can beexplained by considering that each miner mustcompute the frequency of items and itemsets overa window of 5 time periods, and possesses a cachethat can contain one complete block of data. Ifthere is only one miner per domain, this miner canstore one block and must retrieve the remainingfour blocks from the local Data Cacher. As thenumber of miners increases, each Miner needs to

Fig. 9 Data traffic pereach block of datagenerated by the DataStreams. In thisexperiments,Nt = 30, 000, W = 5and CM = B

0

1000

2000

3000

4000

5000

6000

7000

DS->M M->DC

DC->M M->SM DTDomain SM->SMC DTNet

Dat

a tr

affi

c (K

byte

)

Connection Type (From->To)

NMD=1NMD=2NMD=3NMD=4NMD=5NMD=6NMD=8

NMD=10NMD=12



request less data from the Data Cacher, and needsno data when NMD equals or exceeds the windowsize. Indeed, with NMD miners per domain, eachminer receives from the Data Source an amount ofdata approximately equal to B/NMD (because theblock is partitioned and distributed to the Miners),and then can store the data corresponding to NMD

consecutive blocks, since the size of such dataamounts to B.

Conversely, the component M→SM slightly in-creases with the number of miners, because everyminer sends the local SM a set of results (thesketch and the support counts of itemsets) havingapproximately the same size. However, as thecontribution of DC→M has a larger weight thanM→SM, the overall amount of data exchangedwithin a domain decreases as NMD increases from1 to 5. For larger values of NMD, DTDomain startsto increase, because M→SM slightly increasesand DC→M gives no contribution. A similartrend is observed for the values of DTNet.

An efficiency analysis was performed in ac-cordance with the study of parallel architecturespresented in [21]. Specifically, we extracted theoverall computation time TC, i.e., the sum of thecomputation times measured on the different min-ers, and the overall overhead time TO, defined asthe sum of all the times spent in other activities,which practically coincide with the transfer times.These two indexes are shown in Fig. 10 using a

1000

10000

100000

1e+006

3 6 9 12 15 18 21 24 27 30 33 36

T1,

To

(ms)


T1; Nt=20000To; Nt=20000T1; Nt=30000To; Nt=30000T1; Nt=40000To; Nt=40000

Fig. 10 Analysis of Kosarak: overall computing time (TC)and overhead time (TO), vs. the number of miners, fordifferent values of the number of transactions per block, Nt

logarithmic scale. Not surprisingly, the overheadtime follows a similar trend as the overall amountof transferred data, DTNet, reported in Fig. 9. Theoverall computation time TC, on the other hand,has a nearly constant trend.

The efficiency of the computation can bedefined as the fraction of time that the minersactually devote to computation with respect tothe sum of computation and overhead time: E =

TCTC+TO

. Figure 11 reports efficiency values, whichare very high, always larger than 0.9. It is ap-preciated that efficiency increases as the numberof miners per domain increases up to 5, i.e., thewindow size. This effect is induced by the use ofcaching, as explained in [21]. Specifically, whenNMD increases up to the window size, each minerneeds to request less data from the Data Cacher,because more data can be stored in the localcache: this leads to a higher efficiency. For largervalues of NMD, this effect does not hold any-more and the efficiency slightly decreases, beingstill very high. Moreover, it is noticed that theefficiency increases with the rate of data streams.This means that the distributed architecture isincreasingly convenient when the problem size in-creases, which is a another sign of good scalabilityproperties.

Figure 12 shows the execution time measuredwhen setting the value of Nt to 30,000 and vary-ing the support threshold S. This parameter does

0.5

0.6

0.7

0.8

0.9

1

3 6 9 12 15 18 21 24 27 30 33 36

Eff

icie

ncy


Nt=20000Nt=30000Nt=40000

Fig. 11 Analysis of Kosarak: efficiency vs. the number ofminers, for different values of the number of transactionsper block, Nt


E. Cesario et al.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

3 6 9 12 15 18 21 24 27 30 33 36

Tim

e (m

s)


I+IS; S=1%I+IS; S=2%I+IS; S=3%

Fig. 12 Analysis of Kosarak: execution time for the com-putation of frequent items and itemsets vs. the number ofminers per domain, for different values of the threshold S.The value of Nt is set to 30,000

not influence the time to execute the frequentitems algorithm, which is executed in a singlepass: therefore the figure reports the total execu-tion times regarding the combined computationof frequent items and frequent itemsets (I+IS).As expected, a lower value of the threshold leadsto an increase of the processing time, since thenumber of frequent itemsets, computed at eachstep of the algorithm (see Fig. 5), is larger. Again,when computation is more demanding the ab-solute processing time increases, but the system ismore efficient. This is visible in Fig. 13: the trend

0.5

0.6

0.7

0.8

0.9

1

3 6 9 12 15 18 21 24 27 30 33 36

Eff

icie

ncy


S=1%S=2%S=3%

Fig. 13 Analysis of Kosarak: efficiency vs. the number ofminers per domain, for different values of the threshold S.The value of Nt is set to 30,000

of curves is similar, but efficiency is higher withlower values of the support threshold.

4.2 Experiments with the Dataset Webdocs

A second set of experiments was performed tak-ing the dataset Webdocs as input of the datastream. As the dataset contains representativewords of Web pages filtered by a search engine,the data rate was set to typical values registered byservers of Google and Altavista, again using thesite http://www.webtraffic24.com to do the estima-tion. The considered values for Nt were 1000, 3000and 6000 transactions, with the time period P setto 15 s. A single transaction contains on averagemany more items than the typical Kosarak trans-action, so the block size is larger: for example, withNt = 3000, the value of B is about 750 KBytes,while it was about 150 Kbytes in Kosarak exper-iments. As for the previous set of tests, the fre-quency threshold S was set to 0.02 and the windowsize to 5 blocks.

Figure 14 shows the average time needed tocompute frequent items and itemsets after the ar-rival of a new block. Owing to the higher complex-ity of the dataset—in particular, the larger num-ber of items per transaction—the computation offrequent itemsets is more challenging. To performthis task, 15 miners are sufficient if the data rateis equal to 1000 transactions per time period, but

0

20000

40000

60000

80000

100000

120000

140000

160000

3 6 9 12 15 18 21 24 27 30 33 36

Tim

e (m

s)


P=15000 msI+IS; Nt=1000I+IS; Nt=3000I+IS; Nt=6000

Fig. 14 Analysis of Webdocs: execution time for the com-putation of frequent items and itemsets (I+IS), vs. thenumber of miners, for different values of the number oftransactions per block, Nt


http://www.webtraffic24.com


Fig. 15 Data traffic pereach block of datagenerated by the DataStreams. In thisexperiments, Nt = 3000,W = 5 and CM = B

0

5000

10000

15000

20000

25000

30000

DS->M M->DC

DC->M M->SM DTDomain SM->SMC DTNet

Dat

a tr

affi

c (K

byte

)

Connection Type (From->To)

NMD=1NMD=2NMD=3NMD=4NMD=5NMD=6NMD=8

NMD=10NMD=12

as many as 30 miners are needed when Nt is equalto 6000. It is worth noting that in this scenario anycentralized architecture would have few chancesto keep pace with data, which means that thecomputation of frequent itemsets would have tobe done offline, while the architecture proposedhere can achieve the goal of online processing byusing an appropriate degree of parallelism.

Figure 15 reports the amount of data exchangedwithin single domains and in the whole network.It can be noticed that the absolute amount oftransmitted data is higher than that experiencedwith Kosarak, and reported in Fig. 9.

0.5

0.6

0.7

0.8

0.9

1

3 6 9 12 15 18 21 24 27 30 33 36

Eff

icie

ncy


Nt=1000Nt=3000Nt=6000

Fig. 16 Analysis of Webdocs: efficiency vs. the number ofminers per domain, for different values of the number oftransactions per block, Nt

Figure 16 displays the values of efficiency forthe same experiments. Trends are very similar tothose obtained with the Kosarak data, but hereabsolute values of efficiency are higher. The rea-son is that, although more data must be trans-mitted in the Webdocs scenario, the increase incomputation complexity is even more remarkable.Overall, the fraction of computation time is higherwith respect to the total time than in the Kosarakscenario, therefore efficiency is higher. As donefor Kosarak, also for Webdocs we analyzed thearchitecture behavior with varying support thresh-old and window size. These experiments took usto very similar conclusions, so related results arenot reported here.

5 Related Work

The increase in the data produced by large-scalescientific and commercial applications necessi-tates innovative solutions for efficient transfer andanalysis of data [32]. The analysis of data streamshas recently attracted a lot of attention owing tothe wide range of applications for which it canbe extremely useful. Important challenges arisefrom the necessity of performing most computa-tion with a single pass on stream data, becauseof limitations in time and memory space. Streammining algorithms deal with problems as diverseas clustering and classification of data streams,


E. Cesario et al.

change detection, stream cube analysis, indexing,forecasting, etc [1].

For many important application domains previ-ously mentioned in this paper, a major need is toidentify frequent patterns in data streams, eithersingle frequent elements or frequent sets of itemsin transactional databases. A rich survey of algo-rithms for discovering frequent items is providedby Cormode and Hadjieleftheriou [13]. In theirpaper, discussion focuses on the two main classesof algorithms for finding frequent items. Counter-based algorithms have their foundation on sometechniques proposed in the early 80s to solve theMajority problem [17], i.e., the problem of findinga majority element in a stream, using a singlecounter. Variants of this algorithm were devised,sometimes decades later, to discover items whosefrequencies exceed any given threshold. Lossy-Counting is perhaps the most popular algorithmof this type [26].

The second class of algorithms compute asketch, i.e., a linear projection of the input, andprovide an approximated estimation of item fre-quencies using limited computing and memoryresources. Popular algorithms of this kind areCountSketch [8] and CountMin [14], and the latteris adopted in this paper. Advantages and limi-tations of sketch algorithms are discussed in [3].Important advantages are the notable space effi-ciency (required space is logarithmic in the num-ber of distinct items), the possibility of naturallydealing with negative updates and item deletions,and the linear property, which allows sketches ofmultiple streams to be computed by overlappingthe sketches of single streams. The main limitationis the underlying assumption that the domain sizeof the data stream is large, however this assump-tion holds in many significant domains.

Even if modern single-pass algorithms are ex-tremely sophisticated and powerful, multi-pass al-gorithms are still necessary either when the streamrate is too rapid, or when the problem is inherentlyrelated to the execution of multiple passes, whichis the case, for example, of the frequent itemsetsproblem [9]. Single-pass algorithms can be forcedto check the frequency of 2- or 3-itemsets, butthis approach cannot be generalized easily, as thenumber of candidate k-itemsets is combinatorial,and it can become very large when increasing the

value of k [22]. Therefore, a very promising av-enue could be to devise hybrid approaches, whichtry to combine the best of single- and multiple-pass algorithms. A strategy of this kind, discussedin [31], is adopted in the mining architecture pre-sented in this paper.

The analysis of streams is even more challeng-ing when data is produced by different sourcesspread in a distributed environment, an ever morefrequent case. For example, NASA [5] producesmany terabytes of data per month through thou-sands of observation instruments located at multi-ple and distant data centers, and needs to extractinformation from this massive amount of streamdata using efficient data mining applications. An-other example comes from the multiplicity of datastreams occurring in complex scientific workflowsor in large-scale distributed collaborations exac-erbate this problem, particularly when differentstreams have different performance require-ments [6].

A thorough discussion of the approaches cur-rently used to mine multiple data streams can befound in [30]. The paper distinguishes betweenthe centralized model, under which streams are di-rected to a central location before they are mined,and the distributed model, in which distributedcomputing nodes perform part of the computationclose to the data, and send to a central site onlythe models, not the data. Of course, the distrib-uted approach has notable advantages in terms ofdegree of parallelism and scalability.

An interesting approach for the continuoustracking of complex queries over collections ofdistributed streams is presented in [11]. To re-duce the communication overhead, the adoptedstrategy combines two technical solutions: (i) re-mote sites only communicate to the coordinatorconcise summary information on local streams (inthe form of sketches); (ii) even such communi-cations are avoided when the behavior of localstreams remains reasonably stable, or predictable:updates of sketches are only transmitted when acertain amount of change is observed locally. Thesuccess of this strategy depends on the level ofapproximation on the results that is tolerated. Asimilar approach is adopted in [29]: here streamdata is sent to the central processor after beingfiltered at remote data sources. The filters adapt



to changing conditions to minimize stream rateswhile guaranteeing that the central processor stillreceives the updates necessary to provide answersof adequate precision.

In [25], the problem of finding frequent itemsin the union of multiple distributed streams istackled by setting a hierarchical communicationtopology in which streams are the leaves, a cen-tral processor is the root, and intermediate nodescan compress data flowing from the leaves to theroot. The amount of compression is dynamicallyadapted to make the tolerated error (differencefrom estimation and actual frequency of items)follow a precision gradient: the error must bevery low at nodes close to the sources, but it cangradually increase as the communication hierar-chy is climbed. The objective of this strategy is tominimize load on the central node while providingacceptable error guarantees on answers.

In [10] authors deal with the problem of miningclosed frequent itemsets from a data stream over asliding window using limited memory space. Theypropose the Moment algorithm and an efficientcompact data structure, i.e., CET, the Closed Enu-meration Tree. This structure is aimed at dynami-cally maintaining a selected set of itemsets over asliding window. The number of nodes needed byCET is shown to be proportional to that of discov-ered closed frequent itemsets, which guaranteesthe compactness of the structure. Another ap-proach for mining frequent itemsets on streamingdata, claimed as the first one requiring no out-of-core memory structure, is presented in [23]. Theadopted method, named StreamMining, is basedon two main steps. The first is implemented bya one-pass algorithm executed on the streamingdata, and has deterministic bounds on the accu-racy. The second step consists in the execution of atwo-pass algorithm, which improves the accuracyby pruning possible false negatives detected at thefirst step. The algorithm has shown to achievegood performances, but it tends to degrade whenthe average size of frequent itemsets increases.

6 Conclusions

In recent years, the progress in digital data pro-duction and pervasive computing technology have

made it possible to produce and store largestreams of data. Data mining techniques becamevital to analyze such large and continuous streamsof data for detecting regularities or outlier valuesin them. In particular, when data production ismassive and/or distributed, decentralized archi-tectures and algorithms are needed for its analysis.

The distributed stream mining system pre-sented in this paper is a contribution in the fieldand it aims at solving the problem of computingfrequent items and frequent itemsets from distrib-uted data streams by exploiting a hybrid single-pass/multiple-pass strategy. Beyond presentingthe system architecture, we described a prototypethat implements it and discussed a set of experi-ments performed in a real Grid environment. Theexperimental results confirm that the approach isscalable and can manage large data productionby using an appropriate number of miners in thedistributed architecture.

References

1. Aggarwal, C.: An introduction to data streams. In:Aggarwal C. (ed.) Data Streams: Models and Algo-rithms, pp. 1–8. Springer, New York (2007)

2. Aggarwal, C.C.: Data Streams: Models and Algo-rithms. Springer, New York (2007)

3. Aggarwal, C.C., Yu, P.S.: A survey of synopsis con-struction in data streams. In: Aggarwal C. (ed.)Data Streams: Models and Algorithms, pp. 169–207.Springer, New York (2007)

4. Agrawal, R., Mannila, H., Srikant, R., Toivonen,H., Verkamo, A.I.: Fast Discovery of AssociationRules. American Association for Artificial Intelli-gence, Menlo Park (1996)

5. Barkstrom, B., Hinke, T., Gavali, S., Smith, W.,Seufzer, W., Hu, C., Cordner, D.: Distributed gener-ation of nasa earth science data products. J. Grid Com-puting 1(2), 101–116 (2003)

6. Cai, Z., Kumar, V., Schwan, K.: Iq-paths: Predictablyhigh performance data streams across dynamic net-work overlays. J. Grid Computing 5(2), 129–150 (2007)

7. Cesario, E., Grillo, A., Mastroianni, C., Talia, D.: Asketch-based architecture for mining frequent itemsand itemsets from distributed data streams. In: Proc.of the 11th IEEE/ACM International Symposium onCluster, Cloud and Grid Computing (CCGrid 2011),pp. 245–253. Newport Beach, CA (2011)

8. Charikar, M., Chen, K., Farach-Colton, M.: Findingfrequent items in data streams. In: Proceedings of theInternational Colloquium on Automata, Languagesand Programming (ICALP) (2002)


E. Cesario et al.

9. Cheng, J., Ke, Y., Ng, W.: A survey on algorithms formining frequent itemsets over data streams. Knowl.Inf. Syst. 16, 1–27 (2008)

10. Chi, Y., Wang, H., Yu, P.S., Muntz, R.R., Fischer, M.:Catch the moment: maintaining closed frequent item-sets over a data stream sliding window. Knowl. Inf.Syst. 10(3), 265–294 (2006)

11. Cormode, G., Garofalakis, M.: Approximate contin-uous querying over distributed streams. ACM Trans.Database Syst. 33(2), 9:1–9:39 (2008)

12. Cormode, G., Hadjieleftheriou, M.: Finding frequentitems in data streams. In: International Conference onVery Large Data Bases (2008)

13. Cormode, G., Hadjieleftheriou, M.: Finding the fre-quent items in streams of data. Commun. ACM 52(10),97–105 (2009)

14. Cormode, G., Muthukrishnan, S.: An improved datastream summary: the count-min sketch and its appli-cations. J. Algorithms 55(1), 58–75 (2005)

15. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.:Optimal Sampling from Distributed Streams. ACMPrinciples of Database Systems (PODS) (2010)

16. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Main-taining stream statistics over sliding windows. SIAM J.Comput. (SIAMCOMP) 31(6), 635–644 (2002)

17. Fischer, M., Salzburg, S.: Finding a majority amongn votes: solution to problem 81-5. J. Algorithms 3(4),376–379 (1982)

18. Frequent itemset mining dataset repository. Availableat http://fimi.ua.ac.be/data/. Accessed June 2013

19. Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Miningdata streams: A review. ACM SIGMOD Rec. 34(1),18–26 (2005)

20. Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Min-ing frequent patterns in data streams at multiple timegranularities. In: Kargupta, H., Joshi, A., Sivakumar,K., Yesha Y. (eds.) Data Mining: Next GenerationChallenges and Future Directions, chap. 3, pp. 191–210.MIT Press, Menlo Park (2004)

21. Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency:measuring the scalability of parallel algorithms and ar-chitectures. IEEE Parallel Distrib. Technol. 1(3), 12–21(1993)

22. Jin, R., Agrawal, G.: An algorithm for in-core frequentitemset mining on streaming data. In: 5th IEEE Inter-national Conference on Data Mining ICDM, pp. 210–217. Houston, TX (2005)

23. Jin, R., Agrawal, G.: An algorithm for in-corefrequent itemset mining on streaming data. In: Pro-ceedings of the 5th IEEE International Conference onData Mining (ICDM 2005), pp. 210–217. Houston, TX(2005)

24. Lucchese, C., Mastroianni, C., Orlando, S., Talia,D.: Mining@home: toward a public resource comput-ing framework for distributed data mining. Concurr.Comput.: Pract. Exper. 22(5), 658–682 (2009)

25. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston,C.: Finding (recently) frequent items in distributed datastreams. In: ICDE ’05: Proceedings of the 21st Interna-tional Conference on Data Engineering, pp. 767–778.Tokyo, Japan (2005)

26. Manku, G., Motwani, R.: Approximate frequencycounts over data streams. In: International Conferenceon Very Large Data Bases (2002)

27. Mastroianni, C., Cozza, P., Talia, D., Kelley, I., Taylor,I.: A scalable super-peer approach for public scientificcomputation. Futur. Gener. Comput. Syst. 25(3), 213–223 (2009)

28. Metwally, A., Agrawal, D., Abbadi, A.: Efficientcomputation of frequent and top-k elements in datastreams. In: International Conference on DatabaseTheory (2005)

29. Olston, C., Jiang, J., Widom, J.: Adaptive filters forcontinuous queries over distributed data streams. In:SIGMOD ’03: Proceedings of the 2003 ACM SIG-MOD International Conference on Management ofData. San Diego, CA (2003)

30. Srinivasan Parthasarathy, A.G., Otey, M.E.: A surveyof distributed mining of data streams. In: Aggarwal C.(ed.) Data Streams: Models and Algorithms, pp. 289–307. Springer, New York (2007)

31. Wright, A.: Data streaming 2.0. Commun. ACM(CACM) 53(4), 13–14 (2010)

32. Yildirim, E., Kosar, T.: End-to-end data-flow paral-lelism for throughput optimization in high-speed net-works. J. Grid Computing 10(3), 395–418 (2012)


http://fimi.ua.ac.be/data/

author's personal copy - unicalscalab.dimes.unical.it/papers/pdf/miningdistributeddata...j grid...

Documents