data mining and information retrieval · cmpt 454: database systems ii –introduction to web...

23
Data Mining and Information Data Mining and Information Retrieval Retrieval Introduction to Web Mining Introduction to Web Mining

Upload: others

Post on 03-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

Data Mining and Information Data Mining and Information RetrievalRetrieval

Introduction to Web MiningIntroduction to Web Mining

Page 2: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 2 / 23

What is Web Mining?What is Web Mining?

Discovering useful information from the World-Wide Web and its usage patterns.

Page 3: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 3 / 23

Web Mining vs. Data MiningWeb Mining vs. Data MiningStructure (or lack of it)

Textual information and linkage structureScale

Data generated per day is comparable to largest conventional “data warehouses”

SpeedOften need to react to evolving usage patterns in real-time (e.g., merchandising)

Page 4: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 4 / 23

Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues

Page 5: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 5 / 23

Size of the WebSize of the WebNumber of pages

Technically, infiniteMuch duplication (30-40%)Best estimate of “unique” static HTML pages comes from search engine claims

Google = 8 billion(?), Yahoo = 20 billion

Number of web sites Netcraft survey says 206,675,938 sites (March 2010)

(http://news.netcraft.com/archives/web_server_survey.html)

Page 6: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 6 / 23

NetcraftNetcraft SurveySurvey

http://news.netcraft.com/archives/web_server_survey.html

Page 7: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 7 / 23

The Web as a GraphThe Web as a GraphPages = nodes, hyperlinks = edges

Ignore contentDirected graph

High linkage8-10 links/page on averagePower-law degree distribution

Page 8: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 8 / 23

Structure of Web GraphStructure of Web GraphLet’s take a closer look at structure

Broder et al (2000) studied a crawl of 200M pages and other smaller crawlsBow-tie structure

Not a “small world”

Page 9: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 9 / 23

BowBow--tie Structuretie Structure

Source: Broder et al, 2000

Page 10: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 10 / 23

What can the graph tell us?What can the graph tell us?Distinguish “important” pages from unimportant ones

Page rankDiscover communities of related pages

Hubs and AuthoritiesDetect web spam

Trust rank

Page 11: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 11 / 23

Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues

Page 12: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 12 / 23

PowerPower--law degree distributionlaw degree distribution

Source: Broder et al, 2000

log

Long tail

Page 13: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 13 / 23

PowerPower--laws galorelaws galoreStructure

In-degreesOut-degreesNumber of pages per site

Usage patternsNumber of visitorsPopularity

And much more…

Page 14: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 14 / 23

Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues

Page 15: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 15 / 23

Extracting Structured DataExtracting Structured Data

http://www.simplyhired.com

Page 16: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 16 / 23

Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues

Page 17: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 17 / 23

Searching the WebSearching the Web

Content aggregatorsThe Web Content consumers

Page 18: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 18 / 23

Ads vs. search resultsAds vs. search results

Page 19: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 19 / 23

Ads vs. search resultsAds vs. search resultsSearch advertising is the revenue model

Multi-billion-dollar industryAdvertisers pay for clicks on their ads

Interesting problemsWhat ads to show for a search?If I’m an advertiser, which search terms should I bid on and how much to bid?

Page 20: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 20 / 23

Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues

Page 21: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 21 / 23

Systems architectureSystems architecture

Memory

Disk

CPUMachine Learning, Statistics

“Classical” Data Mining

Page 22: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 22 / 23

Very LargeVery Large--Scale Data MiningScale Data Mining

Mem

Disk

CPU

Mem

Disk

CPU

Mem

Disk

CPU…

Cluster of commodity nodes

Page 23: Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 23 / 23

Systems IssuesSystems IssuesWeb data sets can be very large

Tens to hundreds of terabytesCannot mine on a single server!

Need large farms of serversHow to organize hardware/software to mine multi-terabye data sets