web mining

68
Page 1 By Rami Shawkat Hatem Al-Salman Advisor Dr.Natheer Khasawneh Co-Advisor Dr. Ahmad Al-Hammouri MINING CLIENT SIDE PARADATA FOR ADAPTIVE WEBPAGES

Upload: rami-alsalman

Post on 29-Nov-2014

478 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Web Mining

Page 1

By

Rami Shawkat Hatem Al-Salman

Advisor

Dr.Natheer Khasawneh

Co-Advisor

Dr. Ahmad Al-Hammouri

MINING CLIENT SIDE PARADATA FOR ADAPTIVE WEBPAGES

Page 2: Web Mining

Page 2

Introduction.

Server logs data.

Clients data.

Framework for collecting and mining client side data.

Three case studies.

Results and Discussions.

Conclusions.

Future Work.

Contents

Page 3: Web Mining

Page 3

Introduction

In the recent years a large number of websites is published.

Current web applications aim to interact with users through rich and

dynamic contents.

In the recent years JavaScript has developed to be more interactive not

only with a client side but also with the server side, Thus, Asynchronous

JavaScript and XML (AJAX) is introduced.

Web personalization is applied by several websites.

Page 4: Web Mining

Page 4

Web personalization

Web personalization concerns to support the user’s specific environment

related to their needs and domain.

Many websites use recommender system for supporting a web

personalization.

Webpage's are personalized based on clients preferences (i.e., interests,

country, gender etc…).

Page 5: Web Mining

Page 5

AMAZON & Web personalization

AMAZON uses recommender system relay on collaborative filtering

technique for producing personal recommendations.

Personal (client) recommendations are generated by computing similarity

between client preference and others.

Collaborative filtering technique consists of three steps:

Record the preferences of a group of clients.

Choose group of clients whose preferences are similar to the target client

using a similarity metric .

Recommend options (i.e., products) to the target client .

Page 6: Web Mining

Page 6

AMAZON as a real example

Recommendations based

on browsing history

Recommendations based

on preferences of people

with similar profile

Page 7: Web Mining

Page 7

AMAZON as a real example

Recommendations based

on most recent viewed

items

Page 8: Web Mining

Page 8

Server logs data

server log is a log file that contains

vectors of data which are recorded by

web server.

The analysis for server logs can help to

understanding client’s behavior (i.e.,

the most and least traffic).

Entry name Server Log Info

IP-Address 178.77.146.157

date [03/Jan/2011:15:20:06 -0800]

request "GET/default.ASPX HTTP/1.0"

status 200

bytes 8788

referrer http://www.just.edu.jo

agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"

Page 9: Web Mining

Page 9

Apache server access.log

Page 10: Web Mining

Page 10

Clients data

Clients data is a data which is recorded

based on the client navigation to the

visited Webpage elements.

Clients data could record the

interactions between clients and the

elements in the visited Webpage.

For example: record the name,

value and spent time for specific

Webpage element.

Entry name Client Info

Element name DIV1

Element value Yes

Spent time 156.77 seconds

IP-Address 178.77.146.157

date [03/Jan/2011:15:20:06 -0800]

request "GET/default.ASPX HTTP/1.0"

status 200

bytes 8788

referrer http://www.just.edu.jo

agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"

Page 11: Web Mining

Page 11

Clients data example

Page 12: Web Mining

Page 12

Problem statement

Most previous studies are investigated by working on server logs data.

The previous studies used Web Usage Mining (WUM) techniques for

extracting the knowledge from this data.

Some tools and systems are proposed for tracking clients data.

The previous studies which related to clients data have not shown the

usefulness of clients data.

Unfortunately , until now there is no complete framework which could

record and mine in the clients logs data.

Page 13: Web Mining

Page 13

Motivations

Some entries can be extracted from the client’s mouse movements over

the visited Webpage.

Extracting useful knowledge from clients data, will help to understanding

clients’ behaviors and attitudes in better way.

Support clients with appropriate recommendations.

The understanding of clients behaviors and needs, will improve the

advertisements for products in WWW.

Page 14: Web Mining

Page 14

Contributions

Until now there is no complete framework which could record and mine in

the clients data.

Thus, the main contribution of this thesis is to building a complete

framework that can recode client’s events and apply the WUM techniques

on this data .

We mainly show the usefulness of the client’s data.

• We customize the client’s data and then we apply WUM techniques on it.

• We build three different web applications and then we integrate our

framework with their.

• We build a recommendation engine which is able to discovering the

client’s patterns .

• We extract the useful information from the client’s data.

We generate client’s data model based on client’s data statistics.

Page 15: Web Mining

Page 15

Framework for collecting and mining client side data

We propose a framework to record and mine client’s side data.

Our framework consists of five phases respectively:

Session identification

Events identification and catching.

Events storing.

Merging and exporting events.

Web mining.

Page 16: Web Mining

Page 16

Framework for collecting and mining client side data

Page 17: Web Mining

Page 17

Session identification

Once a client requests a webpage, the session id is assigned for him.

The session id presents the number of milliseconds since midnight Jan 1,

1970, by this way the assigned session id for each client is a unique.

The generated session id is used to identify all recorded events which

belong to the same user.

The session for the client can be finished by a target button or link.

Page 18: Web Mining

Page 18

Events identification and recording

We identify web elements and associated events.

The clients data is transferred associated with session id via

XmlHttpRequest AJAX call.

Based on AJAX, the transferring data is a lightweight operation (Clients

never feel while data is transferred to server ).

Seven values are recorded: name, value, Item time, session id, Date,

Total mouse's clicks and Personalized.

Personalized, represents the web element that finishes the session.

Page 19: Web Mining

Page 19

Cont, Events identification and recording

Our events are classified into two categories:

Clickstream-based.

Time based.

In the clickstream-based category, the name and value of clicked element

will be transferred.

In the time-based category, the name, the value and the spent time of web

element will be transferred.

Page 20: Web Mining

Page 20

Snapshot of clickstream-based data (Events storing)

Page 21: Web Mining

Page 21

Snapshot of time-based data (Events storing)

Page 22: Web Mining

Page 22

Merging and Exporting data

The records are grouped per client session (session id).

Our merging algorithm works as follow:

1. Load a list of session id’s

2. For each session id:

i. If the data is clickstream-based then accumulate the sequence of

clicks.

ii. If the data is time-based then accumulate the spent time over each

element.

The merged data is exported to another Database table.

The output this phase will be the input for the web mining phase.

Page 23: Web Mining

Page 23

Snapshot of merging data in clickstream-based

Page 24: Web Mining

Page 24

Snapshot of merging data in time-based

Page 25: Web Mining

Page 25

Web Mining

As in every data mining task, the process of Web Usage Mining consists

of three steps:

• Data preprocessing.

• Pattern discovery and web mining.

• Information and Pattern analysis.

Page 26: Web Mining

Page 26

Data preprocessing

Preprocessing or data cleaning process is aiming to remove irrelevant

data and keeps the consistent data.

The preprocessing is fulfilled based on thresholds.

We mainly use two thresholds:

– The total session time.

– The total number of visited elements.

Page 27: Web Mining

Page 27

Pattern discovery and web mining

Page 28: Web Mining

Page 28

Information and Pattern analysis

Most of times, the analysis of the generated patterns and information

allows us to understand clients behavior deeply.

The output of this step can be formulated in many forms.

One of the most important forms is a generated model which is usually

extracted from the statistics (i.e., frequencies.).

Page 29: Web Mining

Page 29

Three case studies

To validate the proposed framework we have integrated the framework

with three different web applications.

The three web applications are:

1. Web based editor controls (TinyMCE).

2. E-commerece web application.

3. E-survey web application.

The three web applications are hosted online.

Page 30: Web Mining

Page 30

TinyMCE

TinyMCE is a platform independent web based Javascript HTML editor

control.

We modified TinyMCE source code to integrate the proposed framework

with it.

The events of TinyMCE belong to general data (or clickstream-based

data).

We applied data mining to cluster and discover the client’s sequence

patterns.

Finally we classify the clustered output.

Page 31: Web Mining

Page 31

Snapshot of TinyMCE

Page 32: Web Mining

Page 32

Data Collection

As a source of data 60 students from JUST in CPE 411 and CPE 311

classes are asked to use our system.

We asked the students to write an advertisement using TinyMCE about

JUST to encourage students from Europe Union (EU) countries to study in

JUST.

The click events are recorded.

The events are merged in a general data mode.

The merged data will be the input for the data preprocessing step.

Page 33: Web Mining

Page 33

Snapshot of merged data

Page 34: Web Mining

Page 34

Data Preprocessing

The collected data was preprocessed by removing invalid sequences .

The invalid sequences were determined based on two thresholds:

1. The number of clicked controls.

2. Total session time which is spent in the sequence .

Heuristically we used 10 clicks as a first threshold and 200 seconds as a

second threshold.

The data preprocessing step reduces the total number of sequences to

be 36 sequences (24 sequences are removed).

Page 35: Web Mining

Page 35

Clustering

We separated student’s sequences into clusters with similar clickstream

sequences.

We applied K-means clustering technique using heuristics numbers

clusters equal to two, three, and four.

We used edit distance as distance measure to calculating the similarity or

dissimilarity between any two objects closing to the mean point.

The main goal of clustering is to label students sequences.

The points represent the student’s

sequences

Page 36: Web Mining

Page 36

Pattern discovery

The clustered sequences are used as an input to the pattern discovery

algorithm.

We applied Generalize Sequence Pattern (GSP) to extract the patterns

from each cluster.

GSP not only discovers the patterns sequences but also preserve the

order of these patterns.

The output of GSP is a top ten patterns for a cluster.

Theses patterns will be assigned later in classification step.

Page 37: Web Mining

Page 37

Classification

The output data of clustering step was used as an input to classification

models.

Total session time, number of controls and the clickstream sequence are

used as three features for our classification models.

The classification models are trained based on these features and data.

We use two classifiers, Naive Bayes and Support Vector Machines.

After training phase, our classifiers were able to classify the new clients to

one of two or three or four classes.

Page 38: Web Mining

Page 38

E-commerce system

In the second case study, E-commerce web application is built from

scratch.

We integrate our framework with it.

Our E-commerce system offers two categories of products, Camera’s and

Mobiles.

The main goal of this web application is to proof, that the classification for

similar clients can be easily and directly done.

Each product has seven features.

Page 39: Web Mining

Page 39

Snapshot of E-commerce system for Mobile’s

Page 40: Web Mining

Page 40

Snapshot of E-commerce system for Camera’s

Page 41: Web Mining

Page 41

Data Collection

As a source of data we depend on three sources:

• Students from JUST University.

• Students from Heinrich-Heine University of Duesseldorf (Germany).

• Social network websites (Facebook, Myspace, etc.).

We record the events.

The events are merged in a time-based mode.

Based on the time-based mode, the times which are spent over any cell

within specific user session, they are aggregated.

Based on our database statistics, 58 clients bought cameras and 54

clients bought mobiles.

Page 42: Web Mining

Page 42

Snapshot of merged data in time-based mode

Page 43: Web Mining

Page 43

Data Preprocessing

The total session time and the number of visited features are used as two

thresholds.

Based on our experiments, we set total session time to be 20 and number

of visited features to be 7.

Based on these thresholds:

– For Cameras data, 40 clients transactions are pruned, and the remaining

clients transactions were 18.

– For Mobiles data, 35 clients transactions are pruned, and the remaining

clients transactions were 20.

Page 44: Web Mining

Page 44

Classification

In the time-based data mode, classification models can be directly

applied on preprocessed data .

Each client transaction is labeled by a buy product button (i.e., client

who bought a camera #1).

Aggregated times which are spent over 28 features (4 products * 7

features), are used as main features.

Our classification models are trained by preprocessed time-based

data.

We use three classifiers Naive Bayes, Support Vector Machines and

Decision Tree (C4.5 algorithm).

Page 45: Web Mining

Page 45

E-survey

In the third case study, E-survey web application is built from scratch.

We integrate our framework with it.

E-survey is a simple web application which allows students to assessing

lecturers by both multiple and assay questions.

The main goal of E-survey is to understand student’s attitude and

behavior.

E-survey Webpage consists of twelve questions (eleven multiple

questions and one assay question).

Each multiple choice question, consists of four options (Can not dot it at

all, weak, good and very good).

Page 46: Web Mining

Page 46

Snapshot of E-Survey

Page 47: Web Mining

Page 47

Data Collection

As a source of data we depend on three sources:

• Students from Yarmook-Accouncting class.

• Students from Jadara-Computer skills class.

• Students from Philadelphia-Design class.

We record the events.

The events are merged in the time-based mode.

Based on the time-based mode, the times which are spent over any

question within specific user session, they are aggregated.

Based on our database statistics, 101 students assessed their lecturers.

– 37 students from Yarmook University, 38 students from Philadelphia

University and 26 students from Jadara University.

Page 48: Web Mining

Page 48

Data Preprocessing

The total session time and the number of visited questions are used as

two thresholds.

Based on our experiments, we set total session time to be 25 and number

of visited questions to be 12.

Based on these thresholds 11 students transactions are discarded from

student Database.

– The remaining transactions are 90.

Page 49: Web Mining

Page 49

Snapshot of preprocessed data

Page 50: Web Mining

Page 50

Classification

The aggregated times which are spent over 12 questions are used as

main 12 features.

In E-Survey, the recorded transactions are not labeled directly.

Labeling is done by a flag question.

Our classification models are trained by preprocessed time-based data.

We use three classifiers Naive Bayes, Support Vector Machines and

Decision Tree (C4.5 algorithm).

Page 51: Web Mining

Page 51

The student’s data model (exponential)

Questions-Freq

0

50

100

150

200

250

300

350

400

450

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58

Time in seconds

Nu

mb

er

of

Qu

es

tio

ns

Questions-Freq

Page 52: Web Mining

Page 52

Evaluation

For evaluation purpose, we use three well known measures which always

used in information retrieval topic, 1. Precision, 2. Recall, 3.F-measure.

The False Positive (FP) and False Negative (FN) measures are used for

evaluating the errors in classification models.

For testing purposes, the classifiers are testing in two modes :

– Training dataset method.

– 5 folds cross-validation method.

Training dataset method uses dataset for both training and testing.

5 folds cross-validation method divides dataset into subsets, one of them

used for testing and the remaining subsets for training.

Page 53: Web Mining

Page 53

5 folds cross-validation method

Green color as training

subsets

Red color as testing

subset

Page 54: Web Mining

Page 54

Results-TinyMCE

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters

Precision

Recall

F-Measure

The Precision, Recall and F-Measure values for NB and DT in 2, 3, 4 clusters using

5-folds cross-validation.

Page 55: Web Mining

Page 55

Results-TinyMCE

0

0.1

0.2

0.3

0.4

0.5

0.6

NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters

FN

FP

False Positive and True Positive values for NB and DT in 2, 3, 4 clusters using 5-

folds cross-validation.

Page 56: Web Mining

Page 56

Results E-Survey

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V

Precision

Recall

F-Measure

Using training dataset Using 5-folds cross-validationUsing training dataset Using 5-folds cross-validation

Page 57: Web Mining

Page 57

Results E-Survey

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V

FN

FP

Using training dataset Using 5-folds cross-validation

Page 58: Web Mining

Page 58

Conclusions

Clients data is very useful.

Clients data has a flexibility to be mined.

Clients data could has multiple forms.

Clustering should be used for labeling unlabeled clients transactions.

Classification is very practical in clients data.

Our complete framework will help to improve clients experiences.

Our classification models show the ability to classify with high accuracy

rate.

Page 59: Web Mining

Page 59

Future Work

We are looking forward to deal with more clients data such as: x,y axis’s.

We are looking for developing new clustering and classification

techniques which can deal efficiently with client’s data.

We will extract more knowledge of clients data.

Page 60: Web Mining

Page 60

Thank You

Page 61: Web Mining

Page 61

Results for E-commerce camera’s

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DT Naïve bayes SVM

Precision

Recall

F-Measure

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

DT Naïve bayes SVM

FN

FP

Page 62: Web Mining

Page 62

Snapshot of the generated tree from decision tree model for camera’s category

Page 63: Web Mining

Page 63

Results for E-commerce mobile’s

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DT Naïve bayes SVM

Precision

Recall

F-Measure

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

DT Naïve bayes SVM

FN

FP

Page 64: Web Mining

Page 64

Snap shot of the generated tree from decision tree model for mobiles category

Page 65: Web Mining

Page 65

Web applications links

http://web-engineering.orgfree.com/

http://easyshoping.orgfree.com/

http://questions.orgfree.com/

Page 66: Web Mining

Page 66

Machine learning Algorithms

Naïve Bayes is a probabilistic model based on Bayesian theorem .

)(

)()|()|(

Fp

CpCFpFCP

r

rrr

Page 67: Web Mining

Page 67

Machine learning Algorithms

C4.5 is a supervised machine learning algorithm which it is developed

originally from ID3 algorithm .

C4.5 generates decision trees from a set of training data based on an

information entropy concept.

Page 68: Web Mining

Page 68

Machine learning Algorithms

SVM is a supervised machine learning

algorithm. The main idea is to find a

separator line which called hyperplane.

Hyperplane separates the n- dimensional

data completely into its two (or more)

classes.