web mining overview

Upload: aadhesh-malola-a

Post on 05-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Web Mining Overview

    1/29

    CS 345A

    Data MiningLecture 1

    Introduction to Web Mining

  • 7/31/2019 Web Mining Overview

    2/29

    What is Web Mining?

    Discovering useful information from

    the World-Wide Web and its usagepatterns

  • 7/31/2019 Web Mining Overview

    3/29

    Web Mining v. Data Mining

    Structure (or lack of it) Textual information and linkage structure

    Scale

    Data generated per day is comparable tolargest conventional data warehouses

    Speed

    Often need to react to evolving usagepatterns in real-time (e.g.,merchandising)

  • 7/31/2019 Web Mining Overview

    4/29

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/31/2019 Web Mining Overview

    5/29

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/31/2019 Web Mining Overview

    6/29

    Size of the Web

    Number of pages Technically, infinite

    Much duplication (30-40%)

    Best estimate of unique static HTMLpages comes from search engine claims Until last year, Google claimed 8 billion(?),

    Yahoo claimed 20 billion

    Google recently announced that their index

    contains 1 trillion pages How to explain the discrepancy?

  • 7/31/2019 Web Mining Overview

    7/29

    The web as a graph

    Pages = nodes, hyperlinks = edges Ignore content

    Directed graph

    High linkage 10-20 links/page on average

    Power-law degree distribution

  • 7/31/2019 Web Mining Overview

    8/29

    Structure of Web graph

    Lets take a closer look at structure Broder et al (2000) studied a crawl of

    200M pages and other smaller crawls

    Bow-tie structure Not a small world

  • 7/31/2019 Web Mining Overview

    9/29

    Bow-tie Structure

    Source: Broder et al, 2000

  • 7/31/2019 Web Mining Overview

    10/29

    What can the graph tell us?

    Distinguish important pages fromunimportant ones

    Page rank

    Discover communities of relatedpages

    Hubs and Authorities

    Detect web spam Trust rank

  • 7/31/2019 Web Mining Overview

    11/29

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/31/2019 Web Mining Overview

    12/29

    Power-law degree distribution

    Source: Broder et al, 2000

  • 7/31/2019 Web Mining Overview

    13/29

    Power-laws galore

    Structure In-degrees

    Out-degrees

    Number of pages per site Usage patterns

    Number of visitors

    Popularity e.g., products, movies, music

  • 7/31/2019 Web Mining Overview

    14/29

    The Long Tail

    Source: Chris Anderson (2004)

  • 7/31/2019 Web Mining Overview

    15/29

    The Long Tail

    Shelf space is a scarce commodity fortraditional retailers Also: TV networks, movie theaters,

    The web enables near-zero-cost

    dissemination of information aboutproducts

    More choice necessitates better filters Recommendation engines (e.g., Amazon)

    How Into Thin Air made Touching the Void abestseller

  • 7/31/2019 Web Mining Overview

    16/29

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/31/2019 Web Mining Overview

    17/29

    Extracting Structured Data

    http://www.simplyhired.com

  • 7/31/2019 Web Mining Overview

    18/29

    Extracting structured data

    http://www.fatlens.com

  • 7/31/2019 Web Mining Overview

    19/29

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/31/2019 Web Mining Overview

    20/29

    Ads vs. search results

  • 7/31/2019 Web Mining Overview

    21/29

    Ads vs. search results

    Search advertising is the revenuemodel

    Multi-billion-dollar industry

    Advertisers pay for clicks on their ads Interesting problems

    What ads to show for a search?

    If Im an advertiser, which search termsshould I bid on and how much to bid?

  • 7/31/2019 Web Mining Overview

    22/29

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

    Two Approaches to Analyzing

  • 7/31/2019 Web Mining Overview

    23/29

    Two Approaches to AnalyzingData

    Machine Learning approach Emphasizes sophisticated algorithms

    e.g., Support Vector Machines

    Data sets tend to be small, fit in memory

    Data Mining approach

    Emphasizes big data sets (e.g., in theterabytes)

    Data cannot even fit on a single disk!

    Necessarily leads to simpler algorithms

  • 7/31/2019 Web Mining Overview

    24/29

    Philosophy

    In many cases, adding more dataleads to better results that improvingalgorithms

    Netflix Google search

    Google ads

    More on my blog:

    Datawocky (datawocky.com)

  • 7/31/2019 Web Mining Overview

    25/29

    Systems architecture

    Memory

    Disk

    CPU

    Machine Learning, Statistics

    Classical Data Mining

  • 7/31/2019 Web Mining Overview

    26/29

    Very Large-Scale Data Mining

    Mem

    Disk

    CPU

    Mem

    Disk

    CPU

    Mem

    Disk

    CPU

    Cluster of commodity nodes

  • 7/31/2019 Web Mining Overview

    27/29

    Systems Issues

    Web data sets can be very large Tens to hundreds of terabytes

    Cannot mine on a single server!

    Need large farms of servers

    How to organize hardware/softwareto mine multi-terabye data sets

    Without breaking the bank!

  • 7/31/2019 Web Mining Overview

    28/29

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/31/2019 Web Mining Overview

    29/29

    Project

    Lots of interesting project ideas If you cant think of one please come discuss

    with us

    Infrastructure

    Aster Data cluster on Amazon EC2 Supports both MapReduce and SQL

    Data Netflix

    ShareThis

    Google

    WebBase

    TREC