combined use of arm and graph clustering methods to find association in urban routes

8/2/2019 Combined use of ARM and graph clustering methods to find association in urban routes

http://slidepdf.com/reader/full/combined-use-of-arm-and-graph-clustering-methods-to-find-association-in-urban 1/5

Combined use of ARM and graph clusteringmethods to find association in urban routes

(Case study: database of Tehran traffic’s status) V. Dehghan, Sh. Khadivi, A. Farahi

Abstract —Finding meaningful association from basket data is one of the oldest problems in data mining. The solution is

analyzing and mining relational rules. In this paper we are going to prepare a proper approach to find traffic influence in routes,

as association rules, this approach contains data mining techniques and clustering methods. First, by different clustering

methods, rotes are grouped into homogeneous clusters and then extract the rules from each of detected clusters. This

approach reduced the needs for searching in massive data list and caused producing interesting rules with low cost time. Finally

the best clustering method will proposed with respect to the results of produced association rules.

Index Terms — Association Rules, Community Detection, Data Mining, Graph Clustering, Transaction Data, Traffic.

—————————— ——————————

1 INTRODUCTION

rban traffic is one of the big concerns for many gov-ernments in their metropolises that this Issue withcitizen infrastructure development like highway and

tunnels is not solved properly yet, it seems, reasons thatcaused the heavy traffic is lack attention to the conductivetraffic field, and lack of evacuation plan for shut roads incritical time. Also because of lack attention to the meas-urement of traffic in routes and prediction of urban traffic.One of the helpful methods to solve this Issue is the use ofdata mining methods, finding hot route segment is one of

the important subjects and is advantage for urban design-ers, police stations and many of other organization thatinvolved in urban traffics.Detection of the routes, that their traffic is similar andimpact to one other, lead to smoother traffic and trafficsteerage. This problem has been addressed with domainknowledge of city[1].In this research we have large data-base of occurred status in Tehran road segments and weplan to analyze this data in form of association rule min-ing, The problem of association rule mining in large data-base has practiced in past few decades. And variety ofalgorithm and methods for this reason introduced. Ouraim in this paper, is finding these rules in traffic data of

Tehran city. With attention to massive data and transac-tional record, current algorithms directly can’t find theserules, so by focusing on this application at this paper we

are going to extract this rules with dominate the in-put/output problem and producing interesting rules thatis why we are trying to use clustering methods. In manyclustering ways we are looking similarity measures to usethem for dividing object in homogeneous groups. Varietyof algorithms was introduced for this reason but all ofthem require parameters which to be set by users to work.Determining these values needs specific knowledge in thisfield and usually changing these parameters causes’ differ-ent results and that be mentioned that the cost of these

changes on huge data is very expensive.We can mention similarity measures such as mynkvfsky,chebyshev, manhattan, euclidean and etc. However in casestudy database of this research two variable, road segment1 and status exist. The second variable includes three attrib-utes which are heavy, very heavy and smooth. Similaritymeasures that is considered, is routes relation and not vari-able similarity. For this reason we must change the space ofproblem to network organization, to find these relations,we need to construct the graph of communication routes.Bye creating the graph and detecting those groups whichtheir internal dense is more than other groups and thenwith solution that we will explain in the rest of paper, we

extract records from this groups that is appropriate for as-sociation rule mining.Many complex systems can be represented as networks,were the relationship between their objects can be illustrat-ed by nodes and edges [11]. Complex systems are usuallyorganized in partitions, which each of them have their ownfunction. In the network representation, such partitionsappear as set of nodes with high density of internal links,while the links between partitions is low; these networksare called communities or modules, and occur in a varietyof large networked systems, finding and studying these

1 routes

————————————————

Vahid dehghan. Is graduate student in payame noor university, andwork in Tehran Municipality+9809144001756

Shahram khadivi, was with Amirkabir University, Tehran, He is nowwith the Department of Computer.

Author is with the Computer Engineering Department,University of Payame Noor, Tehran

U

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617

https://sites.google.com/site/journalofcomputing

WWW.JOURNALOFCOMPUTING.ORG 19



partitions may clarify the organization of systems and theirfunctions [4, 5]. Network of urban streets is complex sys-tem. In 2008 blondel and collegues present a new algorithmand identified language communities in Belgian mobilephone networks of 2 million customers and by analyzing aweb graph of 118 million nodes and more than one billionlinks [6], the result of applying this algorithm on hugenodes was very good. Suppose each route as a node andlink between them as an edge. Our constructed graph has7975 nodes and 16976 edges, given that, each nodes of Teh-ran graph, has list of transactions related to its events dur-ing the year. With mining on these transactional lists wewill produce association rules that are Routes Influences.

2 GRAPH CLUSTERING ALGORITHMS

With regard to what mentioned in introductions, one cansay that aim in networks is optimal community detection;many methods have been developed, using tools and tech-niques from disciplines like physics, biology, appliedmathematics, computer and social science. However it is

still not clear which algorithms are reliable and can be usedin applications. The question of reliable itself is seductive,because this concept requires shared definition of commu-nity and partitions that even with the respect to many re-searches in this field is not done yet. And no agreementsamong researchers exist to define indexes like optimal andreliability. But there has been silent acceptance of a simplenetwork model that can be said, it is a base for comparingand developing clustering algorithms, namely planted l-partitions model [5]. In this model the purpose is graphconstruction with predefined community in size and count,called benchmark graph, to be indexes for comparing, effi-ciency and speed of detecting, in clustering algorithms. In

this model one partition consisting of certain number ofnodes. Each node has one probability pin as Number ofnodes communicate with nodes within a group and poutthat indicate connection a node with other nodes of theother groups, until pin>pout can call these groups arecommunity. If the above condition is not established, thenetwork would be a random graph without any structure.Different benchmark graph are provided, including the GNbenchmark by Girvan and Newman has been named, it has4 groups with 32 members that degree of each node is 16and number of all nodes is 128. It has two fundamentalproblems. First all nodes have a same degree and second,size of each community is equal. As can be seen, this fea-ture is not realistic, because complex networks have heter-ogeneous distribution in degrees [6]. The other benchmarkpresented is called, LFR, which is the basis for comparisonof clustering algorithm. In this benchmark graph node de-gree distribution is heterogeneous and, however it is em-phasized that even if these communities may also be noted,different algorithms may not be able to detect them [6].Most existing algorithms can work well on GN because itssimplicity, but LFR apply heavy test on algorithms andshows its limitations, Results of this study in terms of bothtime complexity and compatibility with existing structuresin the graph and detected community shows that two algo-rithms Infomap, and Blondel had the best performance [5].

Now the question is, whether the algorithm may not haveany of the output? For example these algorithms on ran-dom graphs how they will be act, the experiments per-formed show that the algorithms currently available inmany cases are the fastest and most reliable. In the follow-ing, briefly about the measurements of two algorithms willbe discussed.

2.1 Fast Greedy Modularity

This method is heuristic methods based on optimization.It is based on maximize objective function. Modularity ofa partition is scalar value between 1 and -1 which densityinside a plate with respect to its relationship with otherplates are calculated. Blondel algorithm is fastest that de-veloped on the bases, that identifying communities in a118 million nodes network took only 152 minute[8].

2.2 Random walks

Random walk is a mathematical formula based on theconsecutive steps in a random trajectory. For example thepath of a molecule in the gas passes, or path of an animal

for food. Today, the urban routes generally are not a per-fect square grid, suppose a person reaches a certain con-nection and for remained route walk with possibility ofexisting routes, If the connection is seven Outputs Set foreach of them probably one of the seven. For this a randomwalk on graph called. In an undirected weighted graph, arandom walk on graph is process that start from some ofthe nodes at the beginning of each time step to the othernode transmits. When graph is not weighted, a path isselected randomly from among the neighbor nodes.When graph is weighted the next step is likely the edge isproportional to the weight. Infomap algorithm based onthis method, the communities can be selected [6].

3 GRAPH WEIGHTING

One of the parameters in the most of graph clusteringmethods is applied and is effective in detection of homo-geneity measurement of cluster is edge weight. In thissection According to the information recorded in the da-tabase traffic events, with simple solution we are study-ing the behavior of adjacent nodes at same event time,also determined the similarity of the traffic betweennodes. For example consider two adjacent node and itsoccurred events in table I, comparison of 5689 nodes and2656 to the n transaction recorded, is value between 0 and

1, when the behavior of two node being same the value isclosed to 1 and vice versa, with this solution all routes inTehran graph, weighted.

TABLE I. EXAMPLE RECORDS

status Event Date Time Stamp Routes

Very heavy 2010/19/12 13:15 to 13:30 2526 Smooth 2010/19/12 13:15 to 13:30 5689 Heavy 2010/20/12 15:10 to 15:30 2656 Heavy 2010/20/12 15:10to15:305689






4 ASSOCIATION RULES

Each rule are like A C ,the left part of the rule iscalled antecedent and the right is called the consequentand there is no joint member between two sets, here thereis two parameter called minimum support and minimumconfidence which is define as Eq. 1 and 2:

T

B A

s

(1)

A

B Ac

(2)

A and B appear together in at last s% of transactions. B occurs in at last c% of the transactions in which A

occurs.A set of {road segment , status} called itemsets and

itemsets that satisfy minimum support are frequentitemset, the first efficient algorithm to mine associationrule is Apriori[2], and other algorithms that decrease thecount of reads of the database and to improve computa-

tional efficiency like Partition[3],[4], we use FpGrowth2Because they perform fastest on our data. Each of thesealgorithms has limitation when faced with massive data-bases. One of our goals in this study provide a guidelinefor exploring the association rules in urban routes thatusual methods for discovering these rules are not possi-ble, because of the limitations that mentioned.

5 GATHERING TRAFFIC PROPAGATION

As you see in table I, data base record is transactional

form and isn’t in itemset form, for this reason we need to

compose these transaction, so we use concept of propaga-

tion, so that, based on events time, and at specified intervals,like t=15min, status of all nodes in each community formed

in to itemsets, for example:

1) 2526_Veryheavy,5689_Smooth

2) 2656_Heavy,5689_Heavy

6 APPLYING CLUSTERING METHODS TO

GENERATE HOMOGENOUS COMMUNITY

With the chosen clustering methods, and applying them

on case study graph, communities identifies in each method,

the result of this algorithms is illustrated in fig. I.

2 Association rule mining algorithm that decrease databaseread by constructing frequent pattern tree

Fig I: Top illustration is result of infomap with 485 community and bottomis blodel result with 41 communities.

Fig. II, Example of generated rule, which presented in Google map service






7 ANALYZING RESULTS ON REAL DATA

Our purpose in this section is to answer this question:

which approach in this application, produced interesting and

beneficial rules with optimum time? As you see in table II,

blondel approach divided graph with high modularity to 41sub graph. With support and confidence 5 and 30 respective-

ly, produced 3656 association rule that from this count 682

rule is Comprehensive3, that equivalent to 18% of all rules,

with respect to the contrary of data in comparisons with bas-

ket data, minimum support and confidence here is low value,

infomap approach divided graph to 485 sub graph according

to random walk measurement, with support and confidence

5 and 30 produced 3082 rules that 418 rules is Comprehen-

sive, equivalent 21% of all rules. With average comparison

of two values that produced for comprehensibility in variety

of support and confidence can be reached to this result that

infomap approach in this measurement has good results.

Comprehensibility rule is important here because shows that

one route condition affected many routes and this means is

identifying traffic bottlenecks. There are many measure-

ments for completion of support-confidence approach in

selecting interesting rule [12], a method for assessing the

importance of rules is using a variety of indexes, however

there are no agreements on a specific indexes and this prob-

lem related to data and application [11]. As noted above,

there are many indexes and measurements for assessing the

produced rules with consider their interestingness, here we

use lift and confidence indexes for assessing rules, lift, is

one of the famous statistics calculations of rule interesting-

ness[10], the Eq. 3 shows it, with comparing Eq. 3,4, andour data type in this application , we can say the confidence

index is good, because it consider the relation between A and

B without all transaction, it suppose transaction that include

A but No B,[11]. Suppose two road segment that have direct

traffic relation in specific month of year like fig. III, even if

in the 100% cases equivalent 10 record, A and B occurs to-

gether, the indexes like lift delete it from rule list, because of

the proportion of total transactions that is 1000 record in-

clude A, B, C, D, E, F and with Eq. 1, A and B do not ap-

pear together. With comparing in table III, the average of the

produced values for confidence index in infomap method

and comparing it with blondel method can be pointed that in

infomap result is effective.

(3)

(4)

3 When antecedent of rule is less than consequent the rule iscomprehensive

Fig. III, A, B has direct traffic in 100% of Trans that inserted.

4 CONCLUSION

In this paper with converting the space of problem tograph network, we could overcome the complexity issue,

and with defining a required application of this complexspace that was the effect of routes on each other’s, wewere able to produce interesting rules, fig. II shows one ofthem. The i/o problems with facing large database inARM4 algorithms with this approach and on this applica-tion resolved, and interesting rules generates with ordi-nary approaches was not extractable, with clustering highconnected routes and mining each cluster we are sure thatgenerated rules is meaningful, FpGrowth algorithm test-ed and as expected with facing to high attribute count isextremely efficient. This application result is very usefulfor all organizations that are involved in traffic problemsin their cities.

4 Association rule mining

A

B A Confidence

)()(

)()(

BP A p

B A p B A Lift






TABLE II, RESULT OF ARM WITH FPGROWTH

conf -min=80% supp-min=50% conf -min=50%

supp-min=30% conf -min=50% supp-min=10% conf -min=30%

supp-min=5%

D e t e c t e

d C l u s t e r s

C l u s t e r i n

g A l g o r i t h m s

indexes

Lift

C o n f i d e n c

e Rule lift

C o n f i d e n c e Rules lift

C o n f i d e n c

e Rul lift

C o n f i d e n c

e

R u l e s

1.37 0.94 4 1.57 0.84 15 2.89 0.73 428 3.83 0.64 3656

41 blondel 1.12 0.93 177 1.18 0.80 503 1.66 0.77 1346 2.24 0.66 3082 485 infomap

TABLE III, RESULT OF ARM WITH FPGROWTH AND CALCULATION OF LIFT AND CONFIDENCE MEASUREMENT’S

A v e r a g e T i m e

All Itemset

s Count

conf -

min=80% supp-min=50%

conf -min=50% supp-min=30% conf -min=50%

supp-min=10% conf -min=30% supp-

min=5%

D e t e c t e d C l u s t e r s

C l u s t e r i n g A l g o r i t h m s

C o m p r e h e n -

s i b i l i t y

R u l e s

‐ C o u n t

C o m p r e h e n -

s i b i l i t y

R u l e s

‐ C o u n t

C o m p r e h e n -

s i b i l i t y

R u l e s

‐ C o u n t

C o m p r e h e n -

s i b i l i t y

R u l e s

‐ C o u n t

00:31:32 34926 0 4 0 15 44 428 682 3656 41 blondel 00:03:44 36770 3 177 38 503 116 1346 418 3082 485 infomap

REFERENCES

[1] X. Li, J. Han, J. Lee, Hector, 2007, ‘Traffic Density-Based Dis-

covery of Hot Routes in Road Networks’, Advanced In Tem-

poral and Spatial Databases, pp. 441-459 W.-K. Chen, Linear

Networks and Systems. Belmont, Calif.: Wadsworth, pp. 123-135,

1993.

[2] R. Agrawal, R. Srikant.1994, Fast Algorithms for Mining as-

sociation Rules. In VLDBY Conference, pp. 487-499.

[3] A. Savasere, E. Omiecinski, S. Navathe, 1995, ‘An Efficient Al-

gorithm for Mining Association Rules in Large Databas

Proceedings of the 21st International Conference on VLDB,

pp.432-444.

[4]

J. Han, J. Pei, Y. Yin, 2000, ‘Mining Frequent Patterns withoutCandidate Generation’, ACM-SIGMOD Int, Vol. 29 Issue. 2.

[5] A. Lancichinetti, S. Fortunato, 2010, ‘Community Detection

Algorithms: a Comparative Analysis’, Physical Review E, Vol.

80 Issue. 5.

[6] V. D Blondel, JL. Guillaume, R. Lambiotte, E. Lefebvre, 2008,

’Fast Unfolding of Communities in Large Networks’,

stackes.iop.org, P1008K. Elissa, “An Overview of Decision The-

ory,". Stackes.iop.org,p1008.(unpublished)

[7] S. Schaeffer, 2007, ‘Graph Clustering’, journal of Elsevier, Vol. 1

Issue. 1, pp. 27-64.

[8] S. Furtuinato, 2010, ‘Community Detection in Graph’, Journal of

Elsevier, Vol. 486 Issue. 3-5, pp. 75-174.

[9] D. A. spielman, 2008, ’Spectral Graph Theory, Random Walks

On Graph’, Vol. 1, pp. 1-75.

[10] P. P.Wakabi-Waiswa, V. Baryamureeba, 2008, ‘Extraction Of

Interesting Association Rules Using Genetic Algorithms’, Inter-

national Journal of Computing and ICT, Vol. 2, No. 1, pp. 26-33.

[11] T. Reader.Nitesh, V. Chawla, 2010, ‘Market Basket Analysis

with Networks’ Journal of Targeting Measurement and Analy-

sis for Marketing, Vol. 11 Issue. 4, pp.373-386.

[12] M. Plasse, N. Niang, G. Saporta, A. Villemiont,L. Leblond ,2007,

‘Combined Use Of Association Rules Mining And Clustering

Methods To Find Relevant Links Between Binary Rare Attrib-

utes In Large DataSet’, Comput. Statist. Data Anal, Vol. 52 Is-sue. 1, pp. 596-613.

First V. Dehghan graduate student in Payame Noor university facultyof engineering Tehran center should be l; employee in Tehran mu-nicipality information and communication technology.

Second Sh. Khadivi Dr. rer. nat. (Ph.D.) in Computer Science, July2008. Research area: Statistical Machine Translation

Third A. Faraahi Payame Noor University, Computer Engineeringand Information Technology Department.




combined use of arm and graph clustering methods to find association in urban routes

Documents