webminingresearchasurvey web mining research: a survey authors: raymond kosala and hendrik blockeel...

WebWeb MiningMining ResearchResearch: AA SurveySurvey

Authors:Raymond Kosala and Hendrik Blockeel

ACM SIGKDD, July 2000

Presented by Shan Huang, 4/24/2007Revised and presented by Fan Min, 4/22/2009

Revised and Presented by Nima

[Poornima Shetty]Date: 12/06/2011

Course: Data Mining[CS332]

Computer Science DepartmentUniversity Of Vermont

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

2Web Mining Research: A Survey

Introduction

With the huge amount of information available online, the World Wide Web is a fertile area for data mining research.

WWW is a popular and interactive medium to circulate information today.

The Web is huge, diverse, and dynamic.Thus raises the scalability, multimedia data, and

temporal issues respectively.

Web Mining Research: A Survey 3

Four Problems

Finding relevant information Low precision and unindexed information

Creating new knowledge out of available information on the web

A data-triggered process

Personalizing the information Personal preference in content and presentation of the information

Learning about the consumers What does the customer want to do?


Other Approaches

Web mining is NOT the only approach Database approach (DB) Information retrieval (IR) Natural language processing (NLP) Web document community


Direct vs. Indirect Web Mining

Web mining techniques can be used to solve the information overload problems: Directly

Address the problem with web mining techniquesE.g. newsgroup agent classifies whether the news as relevant

Indirectly

Used as part of a bigger application that addresses problems

E.g. used to create index terms for a web search service


The Research

Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning)

Attempt to put research done in a structured way from the machine learning point of view


Outline

Introduction Web MiningWeb Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions


Web Mining: Definition

“Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.”

Can be viewed as four subtasks


Web Mining: Subtasks

Resource finding Retrieving intended web documents

Information selection and pre-processing Select and pre-process specific information from selected documents Kind of transformation processes of the original data retrieved in the

IR process This transformation could be a kind of pre-processing

Generalization Discover general patterns within and across web sites

Analysis Validation and/or interpretation of mined patterns


Web Mining and Information Retrieval

Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible

Goal: Indexing text and searching for useful documents in a collection.

Research in IR: modeling, document classification and categorization, user interfaces, data visualization, filtering etc.

Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

Viewed in this respect, Web mining is part of the (Web) IR process.


Web Mining and Information Extraction

Information Extraction (IE): Transforming a collection of documents, into information that is more easily understood and analyzed.

Building IE systems manually for the general Web are not feasible Most IE systems focus on specific Web sites or

content to extract


Compare IR and IE

IR aims to select relevant documents IE aims to extract the relevant facts from given

documents

IR views the text in a document just as a bag of unordered words IE interested in structure or representation of a

document


Web Mining and The Agent Paradigm

Web mining is often viewed from or implemented within an agent paradigm. Web mining has a close relationship with Intelligent Agents.

User Interface Agents information retrieval agents, information filtering agents, &

personal assistant agents. Distributed Agents

Concerned with problem solving by a group of agents. distributed agents for knowledge discovery or data mining.

Mobile Agents


Web Mining and The Agent Paradigm (contd.)

Two frequently used approaches for developing intelligent agents:

Content-based approach The system searches for items that match based on an

analysis of the content using the user preferences.

Collaborative approach The system tries to find users with similar interests to give

recommendations to. Analyze the user profiles and sessions or transactions.


Outline



Web Mining Categories

Web Content Mining Discovering useful information from web page

contents/data/documents. Web Structure Mining

Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs

Web Usage Mining Extraction of interesting knowledge from logging information

produced by web servers. Usage data from logs, user profiles, user sessions, cookies, user

queries, bookmarks, mouse clicks and scrolls, etc.


Outline



Web Content Data Structure

Web content consists of several types of data Text, image, audio, video, hyperlinks.

Unstructured – free text Semi-structured – HTML More structured – Data in the tables or

database generated HTML pagesNote: much of the Web content data is unstructured text

data.

19Web Mining Research: A Survey 19

Web Content Mining: IR View

Unstructured Documents Bag of words to represent unstructured documents

Takes single word as feature Ignores the sequence in which words occur

Features could be Boolean

Word either occurs or does not occur in a document Frequency based

Frequency of the word in a document Variations of the feature selection include

Removing the case, punctuation, infrequent words and stop words Features can be reduced using different feature selection techniques:

Information gain, mutual information, cross entropy. Stemming: which reduces words to their morphological roots.


Web Content Mining: IR View

Semi-Structured Documents Uses richer representations for features

Due to the additional structural information in the hypertext document (typically HTML and hyperlinks)

Uses common data mining methods (whereas unstructured might use more text mining methods)

Application: Hypertext classification or categorization and clustering, learning relations between web documents, learning extraction patterns or rules, and finding patterns in semi-structured data.


Web Content Mining: DB View

The database techniques on the Web are related to the problems of managing and querying the information on the Web.

DB view tries to infer the structure of a Web site or transform a Web site to become a database

Better information management Better querying on the Web

Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database


Web Content Mining: DB View

DB view mainly uses the Object Exchange Model (OEM) Represents semi-structured data by a labeled graph The data in the OEM is viewed as a graph, with objects as the vertices

and labels on the edges Each object is identified by an object identifier [oid] and Value is either atomic or complex

Process typically starts with manual selection of Web sites for doing Web content mining

Main application: The task of finding frequent substructures in semi-structured data The task of creating multi-layered database


Outline



Web Structure Mining

Interested in the structure of the hyperlinks within the Web

Inspired by the study of social networks and citation analysis Can discover specific types of pages(such as hubs,

authorities, etc.) based on the incoming and outgoing links.

Application: Discovering micro-communities in the Web , measuring the “completeness” of a Web site


Outline



Web Usage Mining

Tries to predict user behavior from interaction with the Web

Wide range of data (logs) Web client data Proxy server data Web server data

Two common approaches Maps the usage data of Web server into relational tables before an

adapted data mining techniques Uses the log data directly by utilizing special pre-processing

techniques


Web Usage Mining

Typical problems: Distinguishing among unique users, server

sessions, episodes, etc. in the presence of caching and proxy servers

Often Usage Mining uses some background or domain knowledge

E.g. site topology, Web content, etc.


Web Usage Mining

Applications: Two main categories:

Learning a user profile (personalized)Web users would be interested in techniques that learn their needs and preferences automatically

Learning user navigation patterns (impersonalized)Information providers would be interested in techniques that

improve the effectiveness of their Web site


Outline



Conclusions

Survey the research in the area of Web mining. Suggest three Web mining categories

Content, Structure, and Usage Mining And then situate some of the research with respect to

these categories

Explored connection between Web mining categories and related agent paradigm


Exam Question #1

Question: Outline the main characteristics of Web information.

Answer: Web information is huge, diverse, and dynamic.


Exam Question #2

Question: Define Web Mining

Answer: Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.


Exam Question #3

Question: What are the three main areas of interest for Web mining?

Answer: (1) Web Content

(2) Web Structure

(3) Web Usage


Thank you!

webminingresearchasurvey web mining research: a survey authors: raymond kosala and hendrik blockeel...

Documents

definition web mining

approaches web mining

web data

data mining research

web search service

world wide web

survey slide

research converging