clustering of web users using session-based similarity measures

7/31/2019 Clustering of Web Users Using Session-Based Similarity Measures

1/7

Edith Cowan University

Research Online

ECU Publications

2001

Clustering of Web Users Using Session-basedSimilarity Measures

Jitian XiaoEdith Cowan University

Yanchun ZhangUniversity of Southern Queensland

This conference paper was originally published as: Xiao, J. , & Zhang, Y. (2001). Clustering of Web Users Using Session-based Similarity Measures.

Proceedings of 2001 International Conference on Computer Networks and Mobile Computing . (pp. 223-228). Beijing, China. IEEE. Original article

available here

This Conference Proceeding is posted at Research Online.

http://ro.ecu.edu.au/ecuworks/4759
http://ro.ecu.edu.au/http://ro.ecu.edu.au/ecuworkshttp://dx.doi.org/10.1109/ICCNMC.2001.962600http://dx.doi.org/10.1109/ICCNMC.2001.962600http://ro.ecu.edu.au/ecuworkshttp://ro.ecu.edu.au/


2/7

Clustering of Web Users Using Session-based Similarity MeasuresJitian X iaoSclioo1 oj' orn/>r te r Q I r$0 rrna tiot Sciei ce,

Edith Cnwcrri Uriiversity, Moirii t Lnwley,WA 6050, Australia,,j.xino@c)c.ir.etlLr.r t

Abstract

1 . IntroductionThe I-apid web rlcvclopmcnt ancl the increased

nun ibcr ol' ivailahlc wch scnrching tools push morcand morc organizations t o put their inli)rmation onthe w c h an d provide web-hascd services. In th enic;intimc, the continuous growth i n th e size anduse ol' the Internet is increasing the tlifl'icultics i nsearching lor information. Reductions on th eInternet traffic load an d user :icccss cost ist1icrcli)rc particular important.

One important rcscarch point in wch usagemining is the clustering ol ' wc h ~ i s c r s ascd on theircomiiioii properties. By nnalyzing thechxic tcr is t ics 01 ' the clustcrs, wcb designers iiiaytindcrstancl the users hcttcr and may provide morcsuita ble, customi;lccl ser vice s to th e users. Onenicthocl to clustcr wch users is to incasurcsimilarity ol' interests bctwccn wch users' ;icccsspatterns and then cluster thcni hascd on thesiinilaritics obtaincd. By mining web users'historical access patterns, n o t only the inli)rmation

Yanchun ZhangDept. cf Matliernatics & Compirt ing,Uriivet-sity qf Soiitherri Queeiislatid,Toowoornha,Qld 4350, Australia,

)V? i? u,vc/.ed1r.trg

about how the wcb is being used, bu t also somedemog raphics and behavioral characteristics o fweb users could bc determined [4]. The navigationpath o f the web-users, i f available to th e server,carries valuable information about th e userintcrcsts.

Thcrc ha s bcen an increased demand forunderstanding of web-users d u e to th e W e bdevclopmcnt and th e increased number of wcb-hascd applications. Based o n dclcrcnt criteria, webusers can bc clustered and uscful knowlcdgc can beextracted I'rom web user access pattern [ 121. Manyapplicntions ciin then bcnefit from the knowledgeobtained. For exnmplc, the r l~t irrni ic ijlpertcxt l inkgerrcrtrriorr among web pages could bc suggcstcdal icr discovering clusters o f web users that exhibits imilar inli)rmation needs [13) . This results i n ahcttcr understanding ol' how users visi t th e website, and leuds to an improved organization o f th ehypertext documents l o r navigational convenience.Another application is the pr@tcl/itig o f webpages to help users to personalize their needs,reducing their waiting time [61. Other applicationsto th i s kind o f knowledge include pros-! ctrclieorganization 12. 31 and mapping between usernavigation paths [ 131.

Thcrc exist quite a I'cw methods 01' clusteringw eb users i n literature 18, 9, 121. However, th edirect application of thcsc methods on th e primitiveuser access data is very inefficient and may notI'ind interesting clusters because a wcb server inayusually contains thousands even millions o f pages,an d web users ma y acccss web pages with adiversity ol'intcrcsts 191.

I n o u r previous work I 51, levcls o f similaritiesof wcb uscrs arc dcfincd to capture different user'sweb-accessing interests. The definition o f th esimilarity is application dependent. Thc similarityfunction could he based on visiting the same orsimilar pages, o r t h e frequency of acccss to a page[ I , 131, o r even on th e visiting orders o f l inks (i.e..users' navigation paths). In latter case, two users

0-7695-1381-6/01 $10.00 0 2001 IEEE 223


3/7

that access the same pages could be mapped intodifferent groups of interest similarities if theyaccess pages in distinct visiting orders. A matrix-based algorithm is then developed to cluster webusers such that the users in the same cluster areclosely related with respect to the similaritymeasure. How ever, the perform ance of theclustering method becom es worse with the increaseof the number of users when a threshold number isreached. Moreover, a user may visit the same website many times using the same or different paths,which makes it hard to measure the visiting-orderbased similarity. To deal with the problems, wemake two extensions to the previous work in thispaper: Firstly, we propose to cluster web usersusing a multilevel clustering method, which ismore suitable for clustering a large amount of webusers. Secondly, the session-based similarity isadopted and the related clustering techniques areupdated.The rest of this paper is organized as follows:Section 2 defines the problem and some necessaryconcepts. In Seciion 3, we will review the matrix-based clustering method proposed in [ IS], and thenpresent a multilevel scheme for large number ofweb users. Simulation results are presented inSection 4 . And S cction S concludes the paper.2. Problem Dlefinitions

For simplicity. we limit ou r conccrns on the partof web users navigation pa th inside a particularweh site. From the Internet browsing logs, wecould gather the following information about a we buser: the frequency of a hyper-page usage, the listsof links she/he selected, the elapsed time betweentwo links, and the order of pages accessed byindividual web users.2.1. User-Session identification

The task of identifying unique users is rathercomplicated by the existence of local caches,corporate firewalls , and proxy servers. Thereforesome heuristics a.re commonly used to help identifyunique users [8]. As the web access logs span along period of time, it is likely that different usersmay use the sa me com puter to access the websites.Thu s, s imply us#ing the machines IP address toidentify unique iisers is quite problematic becausemultiple users may share one computer, e.g., manystudents may share a com puter in an IT laboratory.

W e differentiiate the log entr ies into user-sessions, or simply sessions, through a sessiontimeout. A session refers to the unit of interactionbetween a user amd a web server [9]. It consists of

pages accessed by a user in a certain amount oftime. If the time between page requests exceeds acertain limit, it is assumed that there is anotheruser-session, even though the IP address is thesame. A web users historical access pattern mayhave more than one session because heishe mayvisit a web site from time to time and spendarbitrary amount of t ime between consecutivevisits.

Web user clusters are found based on sessionsinstead of the users entire histories. The fact wecluster sessions instead of users can b e justifiedthat ou r goal is to understand the usage of the weband different sessions of a user may correspond tothe visits with different purposes on mind. Inaddition, multiple users on a share com puter can berepresented by different sessions. In this paper, wewill use session and user interchangeably.

The sessions are identificd by groupingconsecutive pages requested by the same usertogether. The data in a web server log ispreprocessed to form a set of sessions in the formof (session-id , {page- id , r ime / ) ,where session-id i5a unique ID assigned to the session. A session (s-id, PO, 20 , p l , 30, p r 58 ) tells a user spent 20seconds on page po, 30 seconds on page P I , and 58seconds on page p 2 .

The web servers log is scanned to identifysessions. A session is created when a new IPaddress is met in the log. Subsequent request fromthe same IP address is added to the session as longas the elapse of time between two consecutiverequests does not exceed a predefined parameterM U - i d l e - h i e , which is set as 30 minutes in ou rwork. Otherwise, the current session is closed anda new session is created.2.2. Session-based Similarity measures

Suppose that, for a given web site, there are nzsessions S = {s) s2 ..., s,) accessing I I differentweb pages P = ( p / p 2 . .,p n } n some time interval.For each page p, and each session s, we associate ausage value, denoted as use(p , s,), and defined as

10 Othenvise

If p , is accessed by s,

T he use(*, *) vector can be obtained by retrievingthe access logs of the site. If two users accessed thesame pages in sessions, they might have somesimilar interests in the sense that they are interestedin the same information (e.g., news, electricalproducts etc). The number of c ommon pages theyaccessed can measure this s imilarity. The measureis defined by

224


4/7

where ,& u s e@ ,s,) is the total num ber of pages thatwere accessed by the user of session s,, an d&(use@k, s,)* use(pk, )) is the number of commonpages accessed by both s, nd s,. If two users accessthe exact same page s, their similarity will be 1. T h esimilarity measure defined this way is called usagebased (UB) measure.Generally, the similarity between two users canbe measured by counting the number of times theyaccess the common pages at all sites. In this case,the measu re is defined by

where a,, (pk , s,) is the total number of times thatthe user of session s, accesses the page pk at site w.(2) is calledfrequency based (FB) measure.The similarity between two users can bemeasured more precisely by taking into account theactual time the users spent on viewing each webpage. Let t (pb s/) be the time the user of session s/spent on viewing page pk (assume that t(Pk, s/ ) = 0if s/ does not include page p k ) . In this case, thesimilarity between users can be expressed by

where Ck (t(& s,))' is the square sum of the timethe user of session s, spent on viewing pages at thesite, and ck (p k s,)" t@k, s,) is the inner-productover time spent on viewing the common pages byusers of s, an d sJ. Even if two users access exactsame pages, their similarity might be less than 1 inthis case, if they view a page in different amount oftime. (3) is called viewing-time based (VTB)measure.In some applications, the accessing order ofpages by a user is more important than that of thetime on viewing each page. In this case, two users(or sessions) are considered having the sameinterests only when they access a sequence of webpages in the exact same order. The similaritybetween users, in such a situation, can be m easuredby checking the access orders of web pages in theirnavigation paths. Let Q = q l , q2 , ..., r be anavigation path, where q ,, 1 S i S r , stan ds f or t hepage accessed in order. We call Q an r-hop path.Define Q ,as the set of all possible 1-hop subpaths (1S r ) of Q, i.e., Ql = /q,, ql+I,..., q,+l.I i = I , 2, ...,r-l+l}. It is obviously that Ql contains all pages inQ. We cal lf(Q> = U[=, Q / thefeature spac e of path

Q. Note that a cyclic path may include some of itssubpaths more than once, and Q c ( Q ) .are the navigation pathsaccessed by users in session si nd si, respectively.The similarity between si an d si can be , def inedusing the natural angle between paths Q' an d Q'(i.e., c o ~ ( B ~ , , ~ ~), which is defined as :

Now let Q' an d Q

S i r n 4 ( s i , s j )= < Q ~ , Q I I (4),/< Q ~ , Q ~ < Q J , Q J>,

where 1 = min(length(Q'), length(Q')) , an d , is the inner product over the feature spaces ofpaths Qi an d Q, which is defined as, is thesame for all 1 2 min(length(Q'), length(Q')) . W ecall (4) the visiting-order based (V OB) m easure.The similarity between web users isapplication-dependent. If for some reason a morecomplicated similarity measure is needed, anapplicable one could be defined for the individualapplication. For example, consider the similarityamong the navigation paths between web usersdescribed by Shahabi et al . [12]. The navigationpaths described in the paper are reproduced here assessions (the names, in italic, in the paths are thetitle of web pages): (s/, Main, 20, Movies, 15,News, 43, Box-Office, 52, News, 31, Evita, 44);(sz,Music, 11, Box-Office, 12, Crucib le, 13, Books,19); (s3, Ma in, 33, Movies, 21, Box-ofice, 44,News, 53, Box-office, 61, Evita, 31); (sd, Main, 19,Movies, 21, News, 38 , Box-Office, 6 1, News, 24 ,Evita, 3 1, News, 19, E vita, 39); (sj, Movies, 32,Box-Ofice, 17, News, 64, Box-Office, 19, Evita,50) ; (s6, Main , 17, Box-Office, 33, News, 41, Box-OfJice, 54, Evita, 56, News, 47).The computa t ion of similarity among webusers ' sessions results in an m x nz matrix, calledusers' session-based similarity matrix (SM) .Assume that the above six sessions, identified bys I , s2 ..., sg, be the access traces of six users. Byusing the formula (l ), the similarity between themis SMl as shown below. The first and the thirdusers visited the exact same pages, thus thesimilarity between them is 1 (i.e., SMl(1, 3) = 1 ) .On the other hand, the similarity between the firstand the second user is 0.224 (Le,, SM l( 1, 2) =0.224) because only one common page (i .e., B oxOffice) is accessed, although they visited five andfour pages, respectively.

225


5/7

SM l =,224 1 ,224 .224 .25 .251 .224 1 1 ,894 .894I ,224 1 1 ,894 ,894

,894 .25 ,894 ,894 1 .75

2.2. Data preprocessing

SM4=

Sessions are extracted from web user accesslogs. The user access logs provide accurate, activeand objective information about the w eb usages ofthe users. Moreover, most web servers containssuch information in their log of page requests.Each record of !.he web server's log represents apage requests from a web user. A typical recordcontains the user"s IP address, the data and time therequest is receive:d, the URL of the page requested,the protocol of the request, the return code of theserver indicating the status of the request handling,and the size of the page if the request is successful.From such a web server log, user access patterncan be extra cted, which consists of the pages theuser visited and c.he t ime she/he spent o n . Sessionscan then be produced.The data in the SM matrix also need to bepreprocessed. In most cases, we are interested inthose users (sessions) among them higher similarinterests are shown with respect to a particularsimilarity measure. For this purpose, we coulddetermine a similhrity rhreshold, t, to split the userclusters. Users with similarity measure greater thanor equal to t are considered in a same cluster. Forinstance, if r = 0.2, SM4 becomes the followingSM4' after preprocessing.

.01 I .02 ,006 .027 .02,006 .oz I ,063 ,735 ,271,6118 ,006 ,063 I ,066 ,069,096 .027 ,735 ,066 1 ,362.OX .02 ,271 ,069 ,362 1 ,

SM4' =

The similarity measures can be useful forvarious web-based applications, even forimproving the W zb site performance. For example,based on the SM matrix, we are able to cluster web

I 0 0 ,618 0 00 1 0 0 0 00 I) 1 0 ,735 ,271

. 6 I X O 0 I 0 00 0 ,135 I) I 0

users into clusters such that the users in the samecluster are closely related with respect to theinterest s imilarity measures. This clustering resultcan then be used for many applications [12].3. A Matrix-based clustering algorithmIn this section, we will first review the matrix-based web user clustering algorithm proposed in[15], evaluate briefly the performance of it, andthen propose a multilevel scheme for clusteringlarge number of web users.3.1. The matrix-based clustering algorithm

SM(i , j ) represents the similarity measurebetween sessions (or users) si an d si ( 1 5 , j I m).Th e greater the value of SM(i, j), the closer the twousers si an d si are related. Note that SM is asymmetric matrix and elements along the maindiagonal are all the same (i.e., SM(i, i) = 1, V 1 I iI ). Thus only those elements in the uppertriangular matrix need to be stored inimplementation.Clustering web-users into groups is equivalent todecomposing their SM matrix into sub-matrices.Our goal of clustering users (sessions) is achievedby two steps: ( 1 ) permute rows and columns of thematrix such that those "closely related"' elem entsare located closely in the matrix; and (2) find thedividing point that decomposes the matrix into sub -matrices. The clustering algorithm is detailed in[15]. The complexity of the algorithm isO(m210g2m)where m is the number of sessions (orusers).3.2. Multilevel scheme for clustering largeamount of web users

The matrix-based clustering algorithm workswell for a small number of users (or sessions), saym < 100. With the increase of number of users (orsessions), the performance of it becomes worse.We now propose a multilevel scheme fo r clusteringlarge number of web users.Th e idea of multilevel clustering scheme i s likethis: For a large number of sessions, a sessionsimilarity graph (SG) is created whose node setconsists of all sessions. If SM(si , s j) # 0, an edge (si,sj ) is in the edge set with an edge weight of w(s;,sj)= SM(si, si). T h e SG graph is firs t coarseneddown to a threshold (say a hundred) number ofnodes; a partition phase of this much smaller graphis applied; then the partition is projected backtowards the original graph (finer graph); and

226


6/7

finally each part of the finer graph is mapped to asimilarity matrix, that can be clustered using thematrix-based clustering method. Formally, for aweighted graph G,,=(Vo, Eo), with weights both onnodes and edges, the multilevel schem e consists offour phases.Phase I (Coarsetiiiig): Go is transformed into asequence of smaller graphs Cl, G2 , ..., GI; uch thatIV&-IVII>.. IVkI, with lVkl less than a pre-determined threshold t (say t l, eachcontaining about Wq nodes of Go .Phase 3 (Uiicoarsenitig): The partition Pk of Gkis projected back to Go by going throughintermediate partitions P1.1, Pk.2, ..., P I ,PO.Phase 4 (F ine P nrtitioriitig): Each of Vo,, V02, .,Vo4 is further partitioned (say by mapping to SMmatrix and then use the matrix-based clusteringmethod).

Phase 2 and Phase 4 can be implemented usingthe matrix-based algorithm discussed in previoussections, i f the number of nodes of the input graphis no more than thc predetermined threshold.Furthermore, i f the number of the nodes of inputgraph in Phase 4 is still very large (say, greaterthan t ) , then the finer graph can be treated as aninput graph of Phase 1 and a recursive procedureapplies. We focus on thc Phase 1 and the Phase 3in the rest of this section.During the coarsening phase, a sequences of asmaller graphs, each with fewer nodes, isconstructed. Graph coarsening can be achieved invarious ways [141.At coarsening phase, a set of nodes of G, iscombined to form a single node of the next levelcoarser graph Gi+, .Let vi' be the set of nodes ofG, combined to form node v of Cii+'. We refer tonode v as a multinode. The weight of node v isrecomputed according to the weights of the nodesin y " Also, in order to preserve the connectivityinformation in the coarser graph, the edges of v ar ethe union of the edges of the nodes in vi'. n thecase that more than one node of y." have cdges tothe same node U , he weight of the edge of v isequal to the su m of the weights of these edges.This is useful when we evaluate the quality of apartition at a coarser graph. Th e weight of theedge-cut' of the partition in a coarser graph will

An edge-cut of a graph is a set of edges whoseremoval will disconnect thc graph

equal to that of the edge-cut of the same partitionin the finer graph.During the uncoarsening phase, the partition P pof the coarser graph Gk is projected back to th eoriginal graph by going through the graphs Gn.,,G,.z, ..., G I ,GO.Since each node of Gi+, ontains adistinct subset of nodes of G;, obtaining P , fromPi+l s done by simply assigning the set of nodesvi1' ollapsed to v E V j + ~o the partition Pj+,[v](i .e. , Pi[ u] = P j + l [ v ] ,V u E y " ) . e refer to thisscheme as non-refinetnetit uncoarsetiing. In orderto get better projected partitions, refinement duringuncoarsening phase is usually employed. Ourrefinement algorithm to uncoarsen coarser graphsto finer ones is based on the Kernighan-Lin's (KL)partition algorithm [ I I ] .4. SimulationsThe simulation is to demonstratc the capability ofo u r clustering method for clustering Web userswith similar interests. Ubing the obtainedknowledge, we conducted another s imulation todemonstrate thc latency reduction of webdocument pre-fetching between caching proxiesand brow ser users in our work (due to space limit,results of the second simulation is omitted).

L-% 2002 15006 100

= o5 50

+ BFB+ OBVTB

0 03. . $ $similarity threashold: t

Fig 1. Clusteringof 600 user-sessions.In our simulation, we cluster web users usingthe multilevel scheme. The SG graph is producedusing the session-based similarity measures

compu ted from target data sets . Four similaritymeasures (i .e., UB , FB, VTB and VOB) are usedand compared with each other. In the simulations,some data sets are generated, while others areextracted from the actual Internet access log files.For instance, we extracted some trace data from theBU-Web-Client tracc, which can be freely

227


7/7

downloaded from Internet [ 5 ] . The traces containrecords of the HTTP requests and user behavior ofa set of Mosaic clients running in the BostonUniversity Computer Science Department,spanning the timeframe of 21 No v. 1994 through 8M ay 1995. The re are totally 1 ,143,83 9 requests ofdata transfer, from a population of 76 2 differentusers. Due to the memory limitation of ourexperimental selup, we extract some 600 sessionsas our sampling data space.During the simulations, we run 10 times at eachsimulation point Th e number of y-axis is the meanvalues of all runs. Fig . 1 shows the results ofclustering 600 user sessions. The similaritythreshold t changes from 0.1 to 0.9. For a differentt value, different number of clusters is produced.On average, the number of clusters produced forVOB-based measure is always greater than that ofothers, suggesting a finer granularity of clusters ofusers with similar interests. The number of clustersproduced for LIB-based measure is always thesmallest among the four measures.5. Conclusia~nsWe presented a multilevel scheme for clusteringweb users using session-based similarities in orderto capture the com mon interests among web users,which are characterized using different similaritymeasures. As a web user may visits a web site fromtime to time anti spend arbitrary amount of timebetween consecutive visits, the web user clustersare found based on sessions instead of the user'sentire histories. For some popular sites, the webservers may contains thousands even millions ofpages, and web users may access web pages with adiversity of interests. Whenever the number ofsessions is gr ea t 's than a threshold, the sessions,and thus the related users, are clustered through amultilevel clustering scheme, otherwise they areclustered by the matrix-based clustering methods.Experiments have been conducted and the resultshave shown that our method is capable ofclustering web users w ith similar interests.References[ I ] T. Bray, Measuring the Web. Proc. of the FifhInternational Wo rld Wide Web Conferen ce, Paris,France, M ay 1996.[ 2 ] P. Cao and S . Irani, Cost-Aware WWW ProxyCaching Algorithms, Proc. of the 1997 USENIXSymposium on Internet Technology and Systems, Dec1997.

[3] P. Cao, J. Zhang and K. Beach, Active Cache:Caching Dynamic Contents on the Web. Proc. of IFIPInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing(Middleware '98) , 1998[4] L. D. Catledge and J. E. Pitkow. CharacterizingBrowsing strategies in the World Wide Web. ElectronicProc. of the 3"' International WWW Cofferet ice ,Darmstadt, Germany, April 1995.[ 5 ] C. A. Cunha, A. Bestavros and M. E. Crov ella,Characteristics of WWW Client Traces, TechnicalReport , TR-95-010, Boston University Department ofComputer Science,April 1995.[6] C. R. Cunha, and C. F. B. Jaccound, D eterminingWWW User's Next Access and its Application toPrefetching. Proc. of International Sjvnposium onCornpicters and Corninunicatioii'97, Alexandria, Eg ypt,Ju ly 1997.[7] L. Fan, P. Cao, W. Lin and Q. Jacobson, WebPrefetching between Lo w-Bandw idth Client and Proxies:Potential an d Performance,SIGMETRICS'99, 1999.[8] R. Cooley, B. Mobasher and J. Srivastava, DataPreparation for Mining World Wide Web BrowseringPatterns, Knowledge and Iilformation Systems. No. 11999.[9] Y. Fu , K. Sanghu and M-Y Shir, Clustering of WebUsers Based on Access Patterns, Proc. of WEBKDD'99,San Diego, USA, 1999.[ I O ] S . D. Gribble, UC Berkeley Home IP H l T P Traces,J u y 1997, I1~tu:Nww w iic ni.orz/sigcoinnU1ThV .[ I 11 G. Karypis an d V. Kumar. Multilevel GraphPartition And Sparse Matrix Ordering. Intl. Conf onParallel Processing, 1995.[I21 C. Shahabi, A. M. Zarkesh, J. Adibi, and V . Shah,Know ledge Discovery from Users Web Page Navigation,IEEE RIDE'97, 1997.[I31 T. W. Yan, M. Jacobsen, H. G. Molina and U.Dayal, From User Access Patterns to Dynam ic HypertextLinking, Proc. of the Fifth International World WideWeb Conference, Paris, France, M ay 1996.[I41 J. Xiao, Y. Zhang & X . Jia. A Graph-basedMultilevel Scheme for Reducing Disk Access Cost ofSpatial Join Processing. Proc. of the InternationalConference on High Pedormance Computing(HPC'2000) ,Beijing, China, May 200 0, p823-830.[I51 J. Xiao, Y. Zhang, X. Ji a & T. Li. MeasuringSimilarity of Interests for Clustering Web-Users. Proc.of the 12th Australian Database Conference 2001(ADC'20OI) . Gold Coast, Australia, 29 January - 2February, 2001. pp107-114.

228

clustering of web users using session-based similarity measures

Documents