exploring traversal strategy for web forum crawling yida wang, jiang-ming yang, wei lai, rui cai,...
TRANSCRIPT
![Page 1: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/1.jpg)
Exploring Traversal Strategy for Web Forum Crawling
Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma
Chinese Academy of SciencesMicrosoft Research, Asia
April 10, 2023
![Page 2: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/2.jpg)
Outline
• Motivation & Challenge• Our Solution
– System Overview– Traversal Strategy
• Skeleton link identification• Page-flipping link detection
• Evaluation
2
![Page 3: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/3.jpg)
Outline
• Motivation & Challenge• Our Solution
– System Overview– Traversal Strategy
• Skeleton link identification• Page-flipping link detection
• Evaluation
3
![Page 4: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/4.jpg)
Why Web Forum
• Web forum is a huge resource of human knowledge– Over 20% search results are from web forums– Leverage the power of users and communities
• Forum sites have complex link structures– Many shortcut links– Links with permission control– Page-flipping links
4
![Page 5: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/5.jpg)
The Limitation of Generic Crawlers
• In general crawling, each page is treated independently, and each link is treated indiscriminately– Lead to more than 50% useless pages– Ignore the relationships between pages from a same thread
• Forum crawling needs a site-level perspective and a careful selection of links
5
![Page 6: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/6.jpg)
Outline
• Motivation & Challenge• Our Solution
– System Overview– Traversal Strategy
• Skeleton link identification• Page-flipping link detection
• Evaluation
6
![Page 7: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/7.jpg)
What is Site-Level Perspective?
• Understand the organization structure• Find our an optimal Traversal strategy
7
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
The site-level perspective of "forums.asp.net"
![Page 8: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/8.jpg)
![Page 9: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/9.jpg)
![Page 10: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/10.jpg)
Random Sampling
• Randomly sample some pages from a given site
• Adopt a combined strategy of breadth-first and depth-first using a double-ended queue
• Try to cover as many as possible unseen URL patterns
• 1,000 pages are enough
10
![Page 11: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/11.jpg)
![Page 12: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/12.jpg)
Sitemap Construction• A sitemap is a directed graph consisting of a set of
vertices and the corresponding links
• Cluster pages into vertices with the same page layout
• Link = its URL pattern + its location
More details about the first two parts, please refer to our previous work : iRobot: An Intelligent Crawler for Web Forums, in WWW’08
12
![Page 13: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/13.jpg)
![Page 14: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/14.jpg)
Why Skeleton Links
• Crawlers crawl as many as possible unique pages in a given forum site by following skeleton links
• Skeleton links are the most important links supporting the structure of a forum site
• Skeleton links point to all valuable pages without introducing redundant and valueless
14
![Page 15: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/15.jpg)
15
Example of skeleton links from forums.asp.net
![Page 16: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/16.jpg)
How to Identify Skeleton Links
• Aim at all unique pages without duplicates
• An optimal set of skeleton links leads to most unique pages and few duplicates
• Search skeleton links for each valuable vertex– Level by level: Inspired by user browsing behavior– Find an optimal combination of links
• Optimal result comes out after exhausting all!
16
![Page 17: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/17.jpg)
17An illustration of the search process of skeleton links
• Pruning while searching for optimism– Selected but introduce many duplicate pages– Rejected but cause coverage drop significantly
![Page 18: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/18.jpg)
Why Page-Flipping Links
• Crawlers can completely download a long discussion thread divided into several pages by following page-flipping links
• Page-flipping links are a kind of loop-back links in the sitemap. However, not all loop-back links are page-flipping ones
18
![Page 19: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/19.jpg)
19
Example of page-flipping links from forums.asp.net
![Page 20: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/20.jpg)
How to Detect Page-Flipping Links
• For page-flipping links, if there is a path from page A to B, there must be a path follow the same type of links from B to A
• Page-flipping links have larger connectivity score
20
![Page 21: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/21.jpg)
21An illustration of the characteristics of page-flipping links
Connectivity = 722 / 890 = 0.81
Connectivity = 108 / 1153 = 0.09
![Page 22: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/22.jpg)
![Page 23: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/23.jpg)
Crawling
• From the given entry page
• Map a new page to an existing layout vertex
• Follow the explored traversal strategy for out-links from that page
23
![Page 24: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/24.jpg)
Outline
• Motivation & Challenge• Our Solution
– System Overview– Traversal Strategy
• Skeleton link identification• Page-flipping link detection
• Evaluation
24
![Page 25: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/25.jpg)
Experimental Setup
• Contract experiments in eight forums from diverse categories– Mirror pages: Crawled by a real commerce crawler– Structure-driven: Crawled by structure-driven crawler
proposed in SIGIR’06– Our method: Crawled by crawler using our traversal
strategy
25
![Page 26: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/26.jpg)
Evaluation Criteria
26
Coverage
Informativeness
![Page 27: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/27.jpg)
Effectiveness and Efficiency• Effectiveness
27
![Page 28: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/28.jpg)
Effectiveness and Efficiency• Efficiency
28
![Page 29: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/29.jpg)
Evaluation of Page-Flipping Detection
29
![Page 30: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/30.jpg)
Conclusions
• A complete solution to automatically explore an appropriate traversal strategy to a given target forum site is proposed– Skeleton link identification– Page-flipping link detection
• More future work directions– Incremental crawling– Forum page segmentation
30
![Page 31: Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences](https://reader035.vdocuments.net/reader035/viewer/2022070305/551519cf550346c77d8b4ee1/html5/thumbnails/31.jpg)
Thanks!
31