chapter 7 web usage mining part i

34
Chapter 7 Web Usage Mining Part I L. Malak Bagais

Upload: truda

Post on 17-Feb-2016

72 views

Category:

Documents


2 download

DESCRIPTION

Chapter 7 Web Usage Mining Part I. L. Malak Bagais. Web Usage Mining. It’s main goal is to: Discover usage patterns from web data in order to understand and better serve the needs of web based applications. Web Usage Mining. Web usage mining consists of three phases Preprocessing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 7 Web Usage  Mining Part I

Chapter 7Web Usage Mining

Part I

L. Malak Bagais

Page 2: Chapter 7 Web Usage  Mining Part I

It’s main goal is to:Discover usage patterns from web data in

order to understand and better serve the needs of web based applications

Web Usage Mining

Page 3: Chapter 7 Web Usage  Mining Part I

Web usage mining consists of three phases Preprocessing Pattern discovery Pattern analysis

Web Usage Mining

Page 4: Chapter 7 Web Usage  Mining Part I

Generated by users’ interaction with the Web, data sources include:

web-server access logs proxy-server logs browser logs user profiles registration data user sessions and transactions cookies user queries bookmark data mouse clicks and scrolls

Web-Usage Mining

Page 5: Chapter 7 Web Usage  Mining Part I

A server log: set of files consisting of the details of an activity performed

by a server files are automatically created and maintained by the

server The World Wide Web Consortium (W3C) has specified

a standard format for web-server log files There are other proprietary formats for web-server

logs.

Web-Log Processing

Page 6: Chapter 7 Web Usage  Mining Part I

Most web logs contain: IP address of the client making the request date and time of the request URL of the requested page number of bytes sent to serve the request user agent (such as a web browser or web crawler) referrer (the URL that triggered the request)

Logs can all be stored in one file A better alternative is to separate:

access log error log referrer log

Web-Log Processing

Page 7: Chapter 7 Web Usage  Mining Part I

Common log format(http://www. W3.org/Daemon/User/Config/Logging.html#common-logfile-format)

Format of Web Logs

Page 8: Chapter 7 Web Usage  Mining Part I

140.14.6.11 - pawan [06/Sep/2001:10:46:07 -0300] "GET /s.htm HTTP/1.0" 200 2267

140.14.7.18 - raj [06/Sep/2001:11:23:53 -0300] "POST /s.cgi HTTP/1.0" 200 499

GET request that retrieves a file s.htm POST request sends data to a program s.cgi Fields:

client machine’s IP address (140.14.6.11) RFC 1413 identity of the client is missing (-) Date and time Request Error code Number of bytes transferred

Examples of Common Log Format

Page 9: Chapter 7 Web Usage  Mining Part I

An example of a log file in extended format

Examples of Common Log Format

Page 10: Chapter 7 Web Usage  Mining Part I

#Version: version of the extended log file format used

#Fields: fields recorded in the log#Software: software that generated the log#Start-Date: date and time at which the log

was started#End-Date: date and time at which the log

was finished#Date: date and time at which the entry was

added#Remark: Comments that are ignored by

analysis tools

Format of Web Logs

Page 11: Chapter 7 Web Usage  Mining Part I

The directives #Version and #Fields are mandatory and must appear before all the entries

Each field in the #Fields directive can be specified in one of the following ways: an identifier; e.g., time an identifier with a prefix separated by a hyphen; e.g.,

cs-method a prefix following a header in parentheses; e.g.,

sc(Content-type)

Format of Web Logs

Page 12: Chapter 7 Web Usage  Mining Part I

No prefixes for date, time, time-taken, bytes, cached

Prefixes for ip, dns, status, comment, method, uri, uri-stem, uri-query, host

Prefixes can be:cs client to serversc server to clientsr server to remote server (this prefix is used by proxies)rs remote server to server (this prefix is used by proxies)x application-specific identifier

Format of Web Logs

Page 13: Chapter 7 Web Usage  Mining Part I

Analyzing Web logs

Page 14: Chapter 7 Web Usage  Mining Part I

General Summary from Analog

Analyzing Web Logs

Page 15: Chapter 7 Web Usage  Mining Part I

Monthly report from Analog

Analyzing Web Logs

Page 16: Chapter 7 Web Usage  Mining Part I

Daily summary from Analog

Analyzing Web Logs

Page 17: Chapter 7 Web Usage  Mining Part I

Hourly summary from Analog

Page 18: Chapter 7 Web Usage  Mining Part I

Analyzing Web Logs

Page 19: Chapter 7 Web Usage  Mining Part I

Organization report from Analog

Page 20: Chapter 7 Web Usage  Mining Part I

Search-word report from Analog

Page 21: Chapter 7 Web Usage  Mining Part I

Operation-system report from Analog

Page 22: Chapter 7 Web Usage  Mining Part I

Status-code report from Analog

Page 23: Chapter 7 Web Usage  Mining Part I

File size report from Analog

Page 24: Chapter 7 Web Usage  Mining Part I

File type report from Analog

Page 25: Chapter 7 Web Usage  Mining Part I

Directory report from Analog

Page 26: Chapter 7 Web Usage  Mining Part I

FRequest report from Analog

Page 27: Chapter 7 Web Usage  Mining Part I

Analysis of Clickstream: Studying Navigation Paths

Page 28: Chapter 7 Web Usage  Mining Part I

Clickstream using Pathalizer with seven link specification

Analysis of Clickstream: Studying Navigation Paths

Page 29: Chapter 7 Web Usage  Mining Part I

Clickstream using Pathalizer with twenty link specification

Analysis of Clickstream: Studying Navigation Paths

Page 30: Chapter 7 Web Usage  Mining Part I

A brief on-campus session identified by StatViz that browses the bulletin board

Visualizing Individual User Sessions

Page 31: Chapter 7 Web Usage  Mining Part I

A brief off-campus session identified by StatViz with three distinct activities

Visualizing Individual User Sessions

Page 32: Chapter 7 Web Usage  Mining Part I

A long on-campus session identified by StatViz with multiple activities

Visualizing Individual User Sessions

Page 33: Chapter 7 Web Usage  Mining Part I

Requests may not always reach the server as they may be served from a proxy server’s cache

You do not really know: Identity of readers Number of visitors Number of visits User’s navigation path through the site Entry point and referral How users left the site or where they went next How long people spent reading each page How long people spent on the site

Caution in Interpreting Web-Access Logs

Page 34: Chapter 7 Web Usage  Mining Part I

I’ve presented a somewhat negative view here, emphasizing what you can’t find out. Web statistics are still informative: it’s just important not to slip from “this page has received 30,000 requests” to “30,000 people have read this page.” In some sense these problems are not really new to the web—they are present just as much in print media too. For example, you only know how many magazines you’ve sold, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the Web too, rather than making up spurious numbers.

Turner (2004)