1 from user access patterns to dynamic hypertext linking patrick farrell, siddharth gudka, mike...
TRANSCRIPT
1
From User Access Patterns to Dynamic Hypertext Linking
Patrick Farrell, Siddharth Gudka, Mike Oxley, Simon Phillips
A Research Directions In Computing Presentation
T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal
2
Agenda• Introduction• Some theory• The paper• A short critique• After the paper
– Academic research– The Authors’ work
• The technology in use today• Conclusion• Questions
3
Introduction
HypothesisThat hyperlinks to unvisited and indirectly linked
pages can be offered based upon pages the user has already visited
Experimenta) to analyse log files to form clusters of
commonly co-accessed pagesb) to categorize online users into the correct
categories and offer appropriate links
4
Mass customisation
• Concept of adapting things to each user – on a large scale
• Economic benefit in adding value• Satisfied shoppers also more likely to return• What’s new?
– In the physical world, customisation doesn’t scale.
– Using technology and intelligent algorithms, it can.
5
Adaptive Web Sites
• Sites that automatically improve their organisation and presentation based on visitor access patterns
• We can cluster pages on a site together based on their co-occurrence frequency– Likelihood that user will visit page P having
visited Q• For a user browsing the site, use session
history to predict which pages a user may want to access – and so adapt site
6
The Paper
• Yan et al. implement an adaptive web site, based on user access logs.
• Paper discusses different approaches to clustering and implementation
• Experimental data is presented– validating the concept of clustering on an
academic site– showing the value added by an adaptive website
using their technique
• The log analysis software used is published
7
The paper - Justification
• Use the metaphor of a shopper browsing an online shop
• Adaptive site can provide links to similar items to those being browsed– eg “Male Yuppie” browsing executive toys– Might also be interested in sportswear
• As site grows, static links to ‘related’ content more of a challenge - dynamic is much better
• Many practical examples today – but not 10 years ago!
8
Online
The Paper – System Design
Link Generator
HTML Documents
Offline
Access logs
Preprocess Cluster
User Categories
URL
HTML with suggestions
WebServer
End user
9
The paper - Preprocessing
• For each user session– form a n-dimensional vector of the pages
visited– can weight vector elements using a metric
• Number of hits to page• Estimate of time spent on page (possibly
normalised)
• ‘Close’ session vectors in n-dimensional space form a cluster
10
The paper - Clustering
• Different algorithms to cluster vectors by ‘closeness’
• Paper uses Leader algorithm – with additional constraints– Constraint: Minimum hits in a valid
session– Constraint: Minimum cluster size
• Algorithm fast and memory efficient– But not order invariant
11
Dynamic Link Generation
• Use session history to track page a user has visited– Authors buffered logs in memory using a database– Sessions part of most web servers now
• Match partial vector of session with pre-calculated categories to build list of appropriate pages– Partial vector, so Euclidian distance not necessarily
appropriate– May be better to simply count matching categories
• Filter the suggestion list to remove pages visited - and possibly any already adjacent in navigation tree
12
Paper – Experimental results
• Time spent on particular pages follows Zipfian distribution – not useful for page weight
• The authors present a number of experimental results about clustering algorithm parameters, e.g. min. cluster size
• Found clusters on academic website that were not evident from hypertext layout – so clustering serves purpose.
13
Critique• Paper presents new concept of clustering web
accesses – but essentially draws together existing work in other fields
• Makes key simplifications– Ignores any web caching, proxies, etc– Considering all pages in a session as being in a
category is naïve – e.g. navigation pages, indexes, etc
• Weakness in experiments– Authors invented nominal ‘sessions’ based on
unique end-user addresses as server didn’t support sessions
– Only present data for one site• 2,709 sessions – of which 50% were in the same cluster!
14
Further Work
• Garcia-Molina– Beyond Document Similarity: Understanding
Value-Based Search and Browsing Technologies (2000)
• Discusses judging value of web documents based on user behaviour
• Dayal:– Knowledge-Based Support Services: Monitoring
and Adaptation (2000)• Discusses a Knowledge-Based Service deployed within
HP to deliver customer support services.• System adapts based on observed user patterns and
evolving needs
15
Related Work
• Web Prefetching (Jiang & Kleinrock, 1998)– Addresses slow access speeds of World Wide Web
• PREDICTION MODULE: Computes access probabilities.• THRESHOLD MODULE: Computes prefetch thresholds.
– Uses clustering to divide users into categories by access probability
• Restoring Meaningful Episodes in a Proxy Log (Lou et al. 2001)– Extracting user’s activity information from proxy
logs– Classifies individual requests into meaningful
semantic elements– Semantics-based CUT-AND-PICK approach
16
Related Work
• SUGGEST (Baraglia et al. 2002, 2004)– No off-line component– Quality metric to estimate effectiveness of
suggestions
• Media Agents (Wenyin et al. 2003)– Automatic collection of semantic indices of
multimedia data– Semantic descriptions from content of documents– User’s interaction refines semantic indices and
suggests other multimedia data
17
Custom application - Analog
Applications & The Paper
Uses clustering tech to analyse log files
To dynamically generate possibly interesting links
Means
End
Successful(to an extent)
18
1996-2005 Technology Directions
Vivisimo
Google Labs
Clustering Documents
Amazon
Flickr
Tivo
Collaborative Filtering
19
Amazon.com
• Uses recommendation algorithm– person who bought ‘x’ also bought ‘y’
• Item-to-item collaborative filtering– provides recommendations based on grouped
items, not customers
For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2
Ess
ence
20
Amazon.com
• Creates vectors where each vector is an item with M dimensions (customers)
• Similarity between two items computed by measuring cosine of angle between two vectors.
• Offline computation theoretically expensive: O(N2M)
• In practice only O(NM) as most customers have few purchases.
21
Conclusion
• The paper was on the right track
• Appreciated applicability of clustering to e-commerce
• Hypothesis proved by experiment
• Failed to address or even predict scalability issues
22
References
• Author’s Work– Yan, T., Jacobsen, M., Garcia-Molina, H., Dayal, U., ‘From
User Access Patterns to Dynamic Hypertext Linking,’ In: Fifth International World Wide Web Conference, 1996 (Paris, France)
– Paepcke, A., Garcia-Molina, H., Rodriquez, G. and Cho, J., ‘Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies’, In: Stanford University Technical Report, 2000
– Delic, K. A. and Dayal, U., ‘Knowledge-Based Support Services: Monitoring and Adaptation,’ In: Proceedings of the 11th international Workshop on Database and Expert Systems Applications, IEEE Computer Society, 2000
23
References
• Related Work– Baraglia, R., Silverstri, F., Palmerini, P., ‘On-line Generation
of Suggestions for Web Users’, In: Proceedings of IEEE International Conference on Information Technology: Coding and Computing, April 2004
– Baraglia, R., Palmerini, P., ‘A web usage mining system’, In: Proceedings of IEEE International Conference on Information Technology: Coding and Computing, April 2002
– Wenyin, L., Chen, Z., Lin, F., Zhang, H., Ma, W., ‘Ubiquitous Media Agents: A framework for managing personally accumulated multimedia files,’ 9th ACM international conference on multimedia, 2003 (Toronto, Canada)
– Jiang, Z., Kleinrock, L., ‘Web prefetching in a mobile environment’, IEEE Personal Communications 5(5): 25 – 34, October 1998
24
References
– Lou, W., Lu, H., Liu, G., Yiang, Q., ‘Restoring Meaningful Episodes in a Proxy Log’, 2001.
– Ungar, L., Foster, D., ‘Clustering Methods For Collaborative Filtering’, In: AAAI Workshop On Recommendation Systems, 1998.
– Linden, G., Smith, B., York, J., ‘Amazon.com Recommendations Item-to-Item Collaborative Filtering’, In: IEEE Internet Computing, Vo. 7, No. 1, Jan 2003.