1 from user access patterns to dynamic hypertext linking patrick farrell, siddharth gudka, mike...

1

From User Access Patterns to Dynamic Hypertext Linking

Patrick Farrell, Siddharth Gudka, Mike Oxley, Simon Phillips

A Research Directions In Computing Presentation

T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal

2

Agenda• Introduction• Some theory• The paper• A short critique• After the paper

– Academic research– The Authors’ work

• The technology in use today• Conclusion• Questions

3

Introduction

HypothesisThat hyperlinks to unvisited and indirectly linked

pages can be offered based upon pages the user has already visited

Experimenta) to analyse log files to form clusters of

commonly co-accessed pagesb) to categorize online users into the correct

categories and offer appropriate links

4

Mass customisation

• Concept of adapting things to each user – on a large scale

• Economic benefit in adding value• Satisfied shoppers also more likely to return• What’s new?

– In the physical world, customisation doesn’t scale.

– Using technology and intelligent algorithms, it can.

5

Adaptive Web Sites

• Sites that automatically improve their organisation and presentation based on visitor access patterns

• We can cluster pages on a site together based on their co-occurrence frequency– Likelihood that user will visit page P having

visited Q• For a user browsing the site, use session

history to predict which pages a user may want to access – and so adapt site

6

The Paper

• Yan et al. implement an adaptive web site, based on user access logs.

• Paper discusses different approaches to clustering and implementation

• Experimental data is presented– validating the concept of clustering on an

academic site– showing the value added by an adaptive website

using their technique

• The log analysis software used is published

7

The paper - Justification

• Use the metaphor of a shopper browsing an online shop

• Adaptive site can provide links to similar items to those being browsed– eg “Male Yuppie” browsing executive toys– Might also be interested in sportswear

• As site grows, static links to ‘related’ content more of a challenge - dynamic is much better

• Many practical examples today – but not 10 years ago!

8

Online

The Paper – System Design

Link Generator

HTML Documents

Offline

Access logs

Preprocess Cluster

User Categories

URL

HTML with suggestions

WebServer

End user

9

The paper - Preprocessing

• For each user session– form a n-dimensional vector of the pages

visited– can weight vector elements using a metric

• Number of hits to page• Estimate of time spent on page (possibly

normalised)

• ‘Close’ session vectors in n-dimensional space form a cluster

10

The paper - Clustering

• Different algorithms to cluster vectors by ‘closeness’

• Paper uses Leader algorithm – with additional constraints– Constraint: Minimum hits in a valid

session– Constraint: Minimum cluster size

• Algorithm fast and memory efficient– But not order invariant

11

Dynamic Link Generation

• Use session history to track page a user has visited– Authors buffered logs in memory using a database– Sessions part of most web servers now

• Match partial vector of session with pre-calculated categories to build list of appropriate pages– Partial vector, so Euclidian distance not necessarily

appropriate– May be better to simply count matching categories

• Filter the suggestion list to remove pages visited - and possibly any already adjacent in navigation tree

12

Paper – Experimental results

• Time spent on particular pages follows Zipfian distribution – not useful for page weight

• The authors present a number of experimental results about clustering algorithm parameters, e.g. min. cluster size

• Found clusters on academic website that were not evident from hypertext layout – so clustering serves purpose.

13

Critique• Paper presents new concept of clustering web

accesses – but essentially draws together existing work in other fields

• Makes key simplifications– Ignores any web caching, proxies, etc– Considering all pages in a session as being in a

category is naïve – e.g. navigation pages, indexes, etc

• Weakness in experiments– Authors invented nominal ‘sessions’ based on

unique end-user addresses as server didn’t support sessions

– Only present data for one site• 2,709 sessions – of which 50% were in the same cluster!

14

Further Work

• Garcia-Molina– Beyond Document Similarity: Understanding

Value-Based Search and Browsing Technologies (2000)

• Discusses judging value of web documents based on user behaviour

• Dayal:– Knowledge-Based Support Services: Monitoring

and Adaptation (2000)• Discusses a Knowledge-Based Service deployed within

HP to deliver customer support services.• System adapts based on observed user patterns and

evolving needs

15

Related Work

• Web Prefetching (Jiang & Kleinrock, 1998)– Addresses slow access speeds of World Wide Web

• PREDICTION MODULE: Computes access probabilities.• THRESHOLD MODULE: Computes prefetch thresholds.

– Uses clustering to divide users into categories by access probability

• Restoring Meaningful Episodes in a Proxy Log (Lou et al. 2001)– Extracting user’s activity information from proxy

logs– Classifies individual requests into meaningful

semantic elements– Semantics-based CUT-AND-PICK approach

16

Related Work

• SUGGEST (Baraglia et al. 2002, 2004)– No off-line component– Quality metric to estimate effectiveness of

suggestions

• Media Agents (Wenyin et al. 2003)– Automatic collection of semantic indices of

multimedia data– Semantic descriptions from content of documents– User’s interaction refines semantic indices and

suggests other multimedia data

17

Custom application - Analog

Applications & The Paper

Uses clustering tech to analyse log files

To dynamically generate possibly interesting links

Means

End

Successful(to an extent)

18

1996-2005 Technology Directions

Vivisimo

Google Labs

Clustering Documents

Amazon

Flickr

Tivo

Collaborative Filtering

19

Amazon.com

• Uses recommendation algorithm– person who bought ‘x’ also bought ‘y’

• Item-to-item collaborative filtering– provides recommendations based on grouped

items, not customers

For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2

Ess

ence

20

Amazon.com

• Creates vectors where each vector is an item with M dimensions (customers)

• Similarity between two items computed by measuring cosine of angle between two vectors.

• Offline computation theoretically expensive: O(N2M)

• In practice only O(NM) as most customers have few purchases.

21

Conclusion

• The paper was on the right track

• Appreciated applicability of clustering to e-commerce

• Hypothesis proved by experiment

• Failed to address or even predict scalability issues

22

References

• Author’s Work– Yan, T., Jacobsen, M., Garcia-Molina, H., Dayal, U., ‘From

User Access Patterns to Dynamic Hypertext Linking,’ In: Fifth International World Wide Web Conference, 1996 (Paris, France)

– Paepcke, A., Garcia-Molina, H., Rodriquez, G. and Cho, J., ‘Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies’, In: Stanford University Technical Report, 2000

– Delic, K. A. and Dayal, U., ‘Knowledge-Based Support Services: Monitoring and Adaptation,’ In: Proceedings of the 11th international Workshop on Database and Expert Systems Applications, IEEE Computer Society, 2000

23

References

• Related Work– Baraglia, R., Silverstri, F., Palmerini, P., ‘On-line Generation

of Suggestions for Web Users’, In: Proceedings of IEEE International Conference on Information Technology: Coding and Computing, April 2004

– Baraglia, R., Palmerini, P., ‘A web usage mining system’, In: Proceedings of IEEE International Conference on Information Technology: Coding and Computing, April 2002

– Wenyin, L., Chen, Z., Lin, F., Zhang, H., Ma, W., ‘Ubiquitous Media Agents: A framework for managing personally accumulated multimedia files,’ 9th ACM international conference on multimedia, 2003 (Toronto, Canada)

– Jiang, Z., Kleinrock, L., ‘Web prefetching in a mobile environment’, IEEE Personal Communications 5(5): 25 – 34, October 1998

24

References

– Lou, W., Lu, H., Liu, G., Yiang, Q., ‘Restoring Meaningful Episodes in a Proxy Log’, 2001.

– Ungar, L., Foster, D., ‘Clustering Methods For Collaborative Filtering’, In: AAAI Workshop On Recommendation Systems, 1998.

– Linden, G., Smith, B., York, J., ‘Amazon.com Recommendations Item-to-Item Collaborative Filtering’, In: IEEE Internet Computing, Vo. 7, No. 1, Jan 2003.

1 from user access patterns to dynamic hypertext linking patrick farrell, siddharth gudka, mike...

Documents

site slide

user session

cluster slide

user access logs

dayal slide

user access patterns

invariant slide

appropriate links slide