web community mining and web log mining : commody cluster based execution
DESCRIPTION
Web Community Mining and Web log Mining : Commody Cluster based execution. Romeo Zitarosa. Overview. Introduction Web Community Mining Web log mining on MIS Parallel Data Mining on Pc Cluster Performance Evaluation Conclusion. Introduction. Proposed two application of web mining: - PowerPoint PPT PresentationTRANSCRIPT
Mining di Dati Web
Web Community Mining and Web Community Mining and Web log Mining : Commody Web log Mining : Commody
Cluster based executionCluster based execution
Romeo Zitarosa Romeo Zitarosa
Mining di Dati Web
OverviewOverview
IntroductionIntroductionWeb Community MiningWeb Community MiningWeb log mining on MISWeb log mining on MISParallel Data Mining on Pc ClusterParallel Data Mining on Pc ClusterPerformance EvaluationPerformance EvaluationConclusionConclusion
Mining di Dati Web
IntroductionIntroduction
Proposed two application of web Proposed two application of web mining:mining:
1) Extract web Communities1) Extract web Communities
2) Understand Behaviour of Mobile 2) Understand Behaviour of Mobile Internet Users (Usage Internet Users (Usage
Mining) Mining)
Mining di Dati Web
Web Community MiningWeb Community Mining
Web CommunityWeb Communitydef: A web Community is a collection of def: A web Community is a collection of web pages created by individuals or web pages created by individuals or association that have common interests association that have common interests on a specific topic.on a specific topic.
Mining di Dati Web
Proposed techniqueProposed technique
Starts from a set o seedStarts from a set o seed
Based on RPABased on RPA
Create a Community ChartCreate a Community Chart
Mining di Dati Web
Authorities and HubsAuthorities and Hubs
Authority : page with good contents on a Authority : page with good contents on a topic linked by many good hub pages.topic linked by many good hub pages.
Hub : page with a list of hyperlink to Hub : page with a list of hyperlink to valuable pages on a topic, that points to valuable pages on a topic, that points to good authorities.good authorities.
Community Core = Authority + HubsCommunity Core = Authority + Hubs
Mining di Dati Web
Web Community MiningWeb Community Mining
Algorithm: Algorithm: 1. Seed set1. Seed set
2. Apply RSA to each seed:2. Apply RSA to each seed: Built web subgraph and extract Built web subgraph and extract (using HITS) hubs and authority. (using HITS) hubs and authority.
3. Investigate how seed derive other 3. Investigate how seed derive other seed as related pages. seed as related pages.
Mining di Dati Web
ExampleExample
1. Consider that 1. Consider that s s derivesderives t t as related as related page and vice versa. page and vice versa.
“ “s” and “t” are pointed to by s” and “t” are pointed to by similar set of hubs. similar set of hubs.
2. Consider that 2. Consider that s s derivesderives t t as related as related page and but page and but t t doesn’t derives doesn’t derives s.s.
“ “t” is pointed to by many different t” is pointed to by many different hubs so “t” derives a different hubs so “t” derives a different
set of set of related pages related pages
Mining di Dati Web
ObservationObservation
In this way we define a symmertic In this way we define a symmertic derivation relationship for identify derivation relationship for identify Communities.Communities.
Def.Def. Community : Set of pages strongly Community : Set of pages strongly connected by “connected by “s.d.rs.d.r”.”.
Two Communities are related if a Two Communities are related if a member of one community derives a member of one community derives a member of the other community. member of the other community.
Mining di Dati Web
Web Community ChartWeb Community Chart
Def. Is a Graph that consist of Def. Is a Graph that consist of communities as nodes and weighted communities as nodes and weighted edges between nodes.edges between nodes.
The weight represents the relevance of The weight represents the relevance of the communitythe community
We need a tool to browse CommunitiesWe need a tool to browse Communities
Mining di Dati Web
Web Community Chart(2)Web Community Chart(2)
Label assigned manuallyLabel assigned manually
Box = list of URLs sorted by connectivity Box = list of URLs sorted by connectivity score.score.
Def. Connectivity score:Def. Connectivity score: number of derivation relatioship from the number of derivation relatioship from the
node to others node of the community.node to others node of the community.
Mining di Dati Web
ExampleExample
Mining di Dati Web
Mobile Info Search (MIS)Mobile Info Search (MIS)
NTT laboratories NTT laboratories
Goal : provide location aware Goal : provide location aware information from internet collecting, information from internet collecting, structuring, filtering and organizing.structuring, filtering and organizing.
www.kokono.netwww.kokono.net
Mining di Dati Web
kokonokokono
There is a database-type resource There is a database-type resource between user and information souces between user and information souces (online maps,yellow pages, etc.)(online maps,yellow pages, etc.)
Mining di Dati Web
MIS FunctionalitiesMIS Functionalities User Location AcquisitionUser Location Acquisition
- GPS,PHS,postal number- GPS,PHS,postal number
Location Oriented Robot-Based Search(kokono)Location Oriented Robot-Based Search(kokono)- search documents close to a location- search documents close to a location- display documents in order of distance - display documents in order of distance
written in the doc and user position written in the doc and user position
Location Oriented Meta SearchLocation Oriented Meta Search- backbone database accessed by - backbone database accessed by CGI programs. CGI programs.
Mining di Dati Web
Association Rule MiningAssociation Rule Mining Support , confidenceSupport , confidence
Hierarchy => TaxonomyHierarchy => Taxonomy
Hierarchy allow to find not only rules specific to a location but Hierarchy allow to find not only rules specific to a location but also wider area that covers that location.also wider area that covers that location.
Identify Acces patterns of MIS users.Identify Acces patterns of MIS users.
Prefetch information.Prefetch information.
Reduce acces time.Reduce acces time.
Spatial information gives valuabel information to mobile users.Spatial information gives valuabel information to mobile users.
Mining di Dati Web
Sequential Rule MiningSequential Rule Mining Sequential PatternsSequential Patterns
Derive how different services are used together.Derive how different services are used together.
Example:Example:Define the plan after checking the weather:Define the plan after checking the weather:Submit_weather = Wether Forecast Submit_weather = Wether Forecast subimit_shop = Shop Info && shop_web = townpage subimit_shop = Shop Info && shop_web = townpage Submit_kokono = KOKONOSearch Submit_kokono = KOKONOSearch Submit_map = MAP Submit_map = MAP
Mining di Dati Web
Parallel DM and Pc ClusterParallel DM and Pc Cluster Parallel AprioriParallel Apriori
- nodes keep all candidate itemsets- nodes keep all candidate itemsets- scan indipendently the dataset- scan indipendently the dataset- comunicate only at the end of the phase- comunicate only at the end of the phase
Problem : Too much memory used!!!Problem : Too much memory used!!!
Solution (Partial) : Hash Partitioned Apriori (HPA).Solution (Partial) : Hash Partitioned Apriori (HPA). - candidates are partitioned using hash function- candidates are partitioned using hash function - each node buils candidate Itemsets- each node buils candidate Itemsets
- a lot of disk I/O when support is small - a lot of disk I/O when support is small
Mining di Dati Web
Parallel Algorithm for Association Parallel Algorithm for Association Rule MiningRule Mining
Non partitioned generalized (NPGM)Non partitioned generalized (NPGM)
Hash Partitioned (HPGM)Hash Partitioned (HPGM)- reduce communications- reduce communications
Hierarchical HPGM (H-HPGM)Hierarchical HPGM (H-HPGM)- candidate whoose root is identical allocated - candidate whoose root is identical allocated
on on the same node the same node
H-HPGM with Fine Grain Duplicates H-HPGM with Fine Grain Duplicates (H-HPGM-FGD)(H-HPGM-FGD)
- use remaining free space- use remaining free space
Mining di Dati Web
Performance evaluationPerformance evaluation
Oss. Time increase when support becomes Oss. Time increase when support becomes smallsmall
Mining di Dati Web
ConclusionConclusion
Real web Mining application need high Real web Mining application need high performance computing systemperformance computing system
Pc Cluster with his scalable Pc Cluster with his scalable performance (and high costs) is a performance (and high costs) is a promising platform…promising platform…