web community mining and web log mining : commody cluster based execution

21
Mining di Dati Web Web Community Mining and Web Community Mining and Web log Mining : Commody Web log Mining : Commody Cluster based execution Cluster based execution Romeo Zitarosa Romeo Zitarosa

Upload: telyn

Post on 12-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Web Community Mining and Web log Mining : Commody Cluster based execution. Romeo Zitarosa. Overview. Introduction Web Community Mining Web log mining on MIS Parallel Data Mining on Pc Cluster Performance Evaluation Conclusion. Introduction. Proposed two application of web mining: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Web Community Mining and Web Community Mining and Web log Mining : Commody Web log Mining : Commody

Cluster based executionCluster based execution

Romeo Zitarosa Romeo Zitarosa

Page 2: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

OverviewOverview

IntroductionIntroductionWeb Community MiningWeb Community MiningWeb log mining on MISWeb log mining on MISParallel Data Mining on Pc ClusterParallel Data Mining on Pc ClusterPerformance EvaluationPerformance EvaluationConclusionConclusion

Page 3: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

IntroductionIntroduction

Proposed two application of web Proposed two application of web mining:mining:

1) Extract web Communities1) Extract web Communities

2) Understand Behaviour of Mobile 2) Understand Behaviour of Mobile Internet Users (Usage Internet Users (Usage

Mining) Mining)

Page 4: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Web Community MiningWeb Community Mining

Web CommunityWeb Communitydef: A web Community is a collection of def: A web Community is a collection of web pages created by individuals or web pages created by individuals or association that have common interests association that have common interests on a specific topic.on a specific topic.

Page 5: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Proposed techniqueProposed technique

Starts from a set o seedStarts from a set o seed

Based on RPABased on RPA

Create a Community ChartCreate a Community Chart

Page 6: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Authorities and HubsAuthorities and Hubs

Authority : page with good contents on a Authority : page with good contents on a topic linked by many good hub pages.topic linked by many good hub pages.

Hub : page with a list of hyperlink to Hub : page with a list of hyperlink to valuable pages on a topic, that points to valuable pages on a topic, that points to good authorities.good authorities.

Community Core = Authority + HubsCommunity Core = Authority + Hubs

Page 7: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Web Community MiningWeb Community Mining

Algorithm: Algorithm: 1. Seed set1. Seed set

2. Apply RSA to each seed:2. Apply RSA to each seed: Built web subgraph and extract Built web subgraph and extract (using HITS) hubs and authority. (using HITS) hubs and authority.

3. Investigate how seed derive other 3. Investigate how seed derive other seed as related pages. seed as related pages.

Page 8: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

ExampleExample

1. Consider that 1. Consider that s s derivesderives t t as related as related page and vice versa. page and vice versa.

“ “s” and “t” are pointed to by s” and “t” are pointed to by similar set of hubs. similar set of hubs.

2. Consider that 2. Consider that s s derivesderives t t as related as related page and but page and but t t doesn’t derives doesn’t derives s.s.

“ “t” is pointed to by many different t” is pointed to by many different hubs so “t” derives a different hubs so “t” derives a different

set of set of related pages related pages

Page 9: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

ObservationObservation

In this way we define a symmertic In this way we define a symmertic derivation relationship for identify derivation relationship for identify Communities.Communities.

Def.Def. Community : Set of pages strongly Community : Set of pages strongly connected by “connected by “s.d.rs.d.r”.”.

Two Communities are related if a Two Communities are related if a member of one community derives a member of one community derives a member of the other community. member of the other community.

Page 10: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Web Community ChartWeb Community Chart

Def. Is a Graph that consist of Def. Is a Graph that consist of communities as nodes and weighted communities as nodes and weighted edges between nodes.edges between nodes.

The weight represents the relevance of The weight represents the relevance of the communitythe community

We need a tool to browse CommunitiesWe need a tool to browse Communities

Page 11: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Web Community Chart(2)Web Community Chart(2)

Label assigned manuallyLabel assigned manually

Box = list of URLs sorted by connectivity Box = list of URLs sorted by connectivity score.score.

Def. Connectivity score:Def. Connectivity score: number of derivation relatioship from the number of derivation relatioship from the

node to others node of the community.node to others node of the community.

Page 12: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

ExampleExample

Page 13: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Mobile Info Search (MIS)Mobile Info Search (MIS)

NTT laboratories NTT laboratories

Goal : provide location aware Goal : provide location aware information from internet collecting, information from internet collecting, structuring, filtering and organizing.structuring, filtering and organizing.

www.kokono.netwww.kokono.net

Page 14: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

kokonokokono

There is a database-type resource There is a database-type resource between user and information souces between user and information souces (online maps,yellow pages, etc.)(online maps,yellow pages, etc.)

Page 15: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

MIS FunctionalitiesMIS Functionalities User Location AcquisitionUser Location Acquisition

- GPS,PHS,postal number- GPS,PHS,postal number

Location Oriented Robot-Based Search(kokono)Location Oriented Robot-Based Search(kokono)- search documents close to a location- search documents close to a location- display documents in order of distance - display documents in order of distance

written in the doc and user position written in the doc and user position

Location Oriented Meta SearchLocation Oriented Meta Search- backbone database accessed by - backbone database accessed by CGI programs. CGI programs.

Page 16: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Association Rule MiningAssociation Rule Mining Support , confidenceSupport , confidence

Hierarchy => TaxonomyHierarchy => Taxonomy

Hierarchy allow to find not only rules specific to a location but Hierarchy allow to find not only rules specific to a location but also wider area that covers that location.also wider area that covers that location.

Identify Acces patterns of MIS users.Identify Acces patterns of MIS users.

Prefetch information.Prefetch information.

Reduce acces time.Reduce acces time.

Spatial information gives valuabel information to mobile users.Spatial information gives valuabel information to mobile users.

Page 17: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Sequential Rule MiningSequential Rule Mining Sequential PatternsSequential Patterns

Derive how different services are used together.Derive how different services are used together.

Example:Example:Define the plan after checking the weather:Define the plan after checking the weather:Submit_weather = Wether Forecast Submit_weather = Wether Forecast subimit_shop = Shop Info && shop_web = townpage subimit_shop = Shop Info && shop_web = townpage Submit_kokono = KOKONOSearch Submit_kokono = KOKONOSearch Submit_map = MAP Submit_map = MAP

Page 18: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Parallel DM and Pc ClusterParallel DM and Pc Cluster Parallel AprioriParallel Apriori

- nodes keep all candidate itemsets- nodes keep all candidate itemsets- scan indipendently the dataset- scan indipendently the dataset- comunicate only at the end of the phase- comunicate only at the end of the phase

Problem : Too much memory used!!!Problem : Too much memory used!!!

Solution (Partial) : Hash Partitioned Apriori (HPA).Solution (Partial) : Hash Partitioned Apriori (HPA). - candidates are partitioned using hash function- candidates are partitioned using hash function - each node buils candidate Itemsets- each node buils candidate Itemsets

- a lot of disk I/O when support is small - a lot of disk I/O when support is small

Page 19: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Parallel Algorithm for Association Parallel Algorithm for Association Rule MiningRule Mining

Non partitioned generalized (NPGM)Non partitioned generalized (NPGM)

Hash Partitioned (HPGM)Hash Partitioned (HPGM)- reduce communications- reduce communications

Hierarchical HPGM (H-HPGM)Hierarchical HPGM (H-HPGM)- candidate whoose root is identical allocated - candidate whoose root is identical allocated

on on the same node the same node

H-HPGM with Fine Grain Duplicates H-HPGM with Fine Grain Duplicates (H-HPGM-FGD)(H-HPGM-FGD)

- use remaining free space- use remaining free space

Page 20: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

Performance evaluationPerformance evaluation

Oss. Time increase when support becomes Oss. Time increase when support becomes smallsmall

Page 21: Web Community Mining and Web log Mining : Commody Cluster based execution

Mining di Dati Web

ConclusionConclusion

Real web Mining application need high Real web Mining application need high performance computing systemperformance computing system

Pc Cluster with his scalable Pc Cluster with his scalable performance (and high costs) is a performance (and high costs) is a promising platform…promising platform…