web search, web crawling · web search basics the web! ad indexes! web results 1-10 of about...

13
DD2476 Search Engines and Information Retrieval Systems Lecture 9: Clustering, Web Search, Web Crawling Hedvig Kjellström [email protected] www.csc.kth.se/DD2476 Today Document clustering - Motivations - Flat clustering (Manning Chapter 16.1-4) - Hierarchical clustering (Manning Chapter 17.1-3) Web search, web crawling - Characteristics of web search (Manning Chapter 19) - Web crawling (Manning Chapter 20) DD2476 Lecture 9, March 28, 2012 Clustering Recap: Classification DD2476 Lecture 9, March 28, 2012 Government Science Arts Data points have labels Classification task: Finding good separators Sec.14.1

Upload: others

Post on 14-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

DD2476 Search Engines and Information Retrieval Systems Lecture 9: Clustering, Web Search, Web Crawling Hedvig Kjellström [email protected] www.csc.kth.se/DD2476

Today

• Document clustering - Motivations -  Flat clustering (Manning Chapter 16.1-4) -  Hierarchical clustering (Manning Chapter 17.1-3)

• Web search, web crawling -  Characteristics of web search (Manning Chapter 19) - Web crawling (Manning Chapter 20)

DD2476 Lecture 9, March 28, 2012

Clustering

Recap: Classification

DD2476 Lecture 9, March 28, 2012

Government

Science

Arts

Data points have labels Classification task: Finding good separators

Sec.14.1!

Page 2: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

What is Clustering?

• Clustering: the process of grouping a set of objects into classes of similar objects -  Documents within a cluster should be similar -  Documents from different clusters should be dissimilar

• The commonest form of unsupervised learning -  Unsupervised learning = learning from raw unlabeled

data

• Important task in IR and other areas

DD2476 Lecture 9, March 28, 2012

Ch. 16!

Clear Cluster Structure

DD2476 Lecture 9, March 28, 2012

Ch. 16!

• How would you design an algorithm for finding the three clusters in this case?

Applications of Clustering in IR

• Clustering search results -  Effective “user recall” will be higher, better navigation of

search results

• Scatter-gather -  Better user interface: “search without typing”

• Visualizing a collection -  Easier to browse

• Speeding up vector space retrieval -  Cluster-based retrieval gives faster search

DD2476 Lecture 9, March 28, 2012

Sec. 16.1!

Clustering Search Results (yippy.com)

DD2476 Lecture 9, March 28, 2012

Sec. 16.1!

Page 3: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Scatter-Gather (Cutting, Karger and Pedersen)

DD2476 Lecture 9, March 28, 2012

Sec. 16.1!

Visualizing a Collection (Google News)

DD2476 Lecture 9, March 28, 2012

Visualizing a Collection (IN-SPIRE, Pacific Northwest National Laboratory)

DD2476 Lecture 9, March 28, 2012

Speeding up Vector Space Retrieval • Cluster hypothesis: Documents in the same cluster

behave similarly with respect to relevance to information needs

• Therefore, to improve search recall: -  Cluster docs in corpus a priori - When a query matches a doc D, also return other docs in

the cluster containing D

• Hope if we do this: The query “car” will also return docs containing automobile -  Because clustering earlier grouped together docs

containing car with those containing automobile.

DD2476 Lecture 9, March 28, 2012

Why might this happen?

Sec. 16.1!

Page 4: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Issues for Clustering

• Representation for clustering -  Document representation (Vector space? Normalization?) -  Need a notion of similarity/distance

• How many clusters? -  Fixed a priori? -  Completely data driven?

DD2476 Lecture 9, March 28, 2012

Sec. 16.2!

Notion of Similarity/Distance

• Ideal: semantic similarity -  Semantially similar documents close -  Semantically different documents far away

• Practical: term-statistical similarity -  Cosine similarity -  Documents normalized vectors in ND, N>>1

• As last week, visualize using Euclidean distance -  For many algorithms, easier to think in terms of a

distance (rather than similarity) between docs -  But real implementations use cosine similarity

DD2476 Lecture 9, March 28, 2012

Clustering Algorithms

• Flat algorithms -  Usually start with a random (partial) partitioning -  Refine it iteratively -  K-means clustering -  (Model based/probabilistic/EM clustering)

• Hierarchical algorithms -  Bottom-up, agglomerative clustering -  (Top-down, divisive clustering)

• More about clustering -  DD2431 Machine Learning (p1) -  DD2427 Image Based Recognition and Classification (p4)

DD2476 Lecture 9, March 28, 2012

Flat Clustering (Manning Chapter 16.1-4)

• Partitioning method: Partition a set of n documents into K clusters

• Given: -  Set of n documents to be clustered -  Number of clusters K

• Find: K clusters that - Globally minimize the intra-cluster distance - Globally maximize the inter-cluster distance

• Intractable for many objective functions (NP-hard) -  Effective heuristic method: K-means

DD2476 Lecture 9, March 28, 2012

Page 5: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

K-Means Example (K=2)

DD2476 Lecture 9, March 28, 2012

Sec. 16.4!

Pick seeds Reassign clusters Compute centroids

x x

Reassign clusters

x x x x Compute centroids

Reassign clusters

Converged!

K-means is great…

• …but what if K = 10,000?

• Basic problem – no structure - Which classes are similar to each other?

DD2476 Lecture 9, March 28, 2012

Hierarchical Clustering (Manning Chapter 17.1-3)

• Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents

• One approach: recursive application of a partitional clustering algorithm

DD2476 Lecture 9, March 28, 2012

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Ch. 17!

Dendrogram: Hierarchical Clustering

• Clustering obtained by cutting the dendrogram at a desired level -  Each connected

component forms a cluster

DD2476 Lecture 9, March 28, 2012

Page 6: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Hierarchical Agglomerative Clustering (HAC)

• Method for obtaining dendrogram

• General approach -  Start with each document in a separate cluster -  Repeatedly join the closest pair of clusters, until there is

only one cluster

• Details -  How to find the closest pair of clusters

DD2476 Lecture 9, March 28, 2012

Sec. 17.1!

Closest pair of clusters

• Many variants to defining closest pair of clusters

• Single-link -  Similarity of the most cosine-similar

• Complete-link -  Similarity of the “furthest” points, the least cosine-similar

• Centroid -  Clusters whose centroids (centers of gravity) are the

most cosine-similar

• Average-link -  Average cosine between pairs of elements

DD2476 Lecture 9, March 28, 2012

Sec. 17.2!

Computational Complexity

• In the first iteration, similarity of all pairs of N instances: O(N2)

• In each of the N-2 merging iterations, compute distance between the most recent cluster and all other

- Worst case: O(N3)

-  Cleverly: O(N2 log N)

DD2476 Lecture 9, March 28, 2012

Sec. 17.2.1!

What is a Good Clustering?

• Internal criterion • A good clustering will produce clusters where:

-  The intra-class (intra-cluster) similarity is high -  The inter-class similarity is low -  The measured quality of a clustering depends on both the

document representation and the similarity measure used

DD2476 Lecture 9, March 28, 2012

Sec. 16.3!

Page 7: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

What is a Good Clustering?

• External criterion • A good clustering is able to:

-  Discover some or all hidden patterns or latent classes in gold standard/benchmark data

• Assess a clustering with respect to ground truth requires labeled data

• Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members.

DD2476 Lecture 9, March 28, 2012

Sec. 16.3!

External Evaluation

• Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster ωi

• Biased because having n clusters maximizes purity

• Other measures: Rand index, entropy/mutual information of classes in clusters

DD2476 Lecture 9, March 28, 2012

Cjnn

Purity ijji

i ∈= )(max1

)(ω

• • • • • •

• • • • • •

• • • • •

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Purity example

DD2476 Lecture 9, March 28, 2012

Sec. 16.3!

Web Search, Web Crawling

Page 8: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Brief history of web engines

• Early keyword-based engines ca. 1995-1997 -  Altavista, Excite, Infoseek, Inktomi, Lycos - Measured how well documents fitted the query (inverted

index)

• Paid search ranking: Goto (morphed into Overture.com → Yahoo!) -  Human categorization effort -  Your search ranking depended on how much you paid

• 1998+: Link-based ranking pioneered by Google - Measures the authoritativeness of pages (PageRank) -  Totally dominating today

DD2476 Lecture 9, March 28, 2012

Web search basics

The Web!

Ad indexes!

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider!

Indexer!

Indexes!

Search!

User!

DD2476 Lecture 9, March 28, 2012

User needs

•  Need [Brod02, RL04] –  Informational – want to learn about something (~40% /

65%)

–  Navigational – want to go to that page (~25% / 15%)

–  Transactional – want to do something (web-mediated) (~35% / 20%)

•  Access a service

•  Downloads

•  Shop –  Gray areas

•  Find a good hub •  Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Seattle weather Mars surface images

Canon S410

Car rental Brasil

DD2476 Lecture 9, March 28, 2012

How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)!

DD2476 Lecture 9, March 28, 2012

Page 9: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Users’ empirical evaluation of results

• Quality of pages varies widely -  Relevance is not enough - Other desirable qualities (non IR!!)

•  Content: Trustworthy, diverse, non-duplicated, well maintained

•  Web readability: display correctly & fast •  No annoyances: pop-ups, etc

• Precision vs. recall - On the web, recall seldom matters

• What matters -  Precision at 1? Precision above the fold? -  Comprehensiveness – be able to deal with obscure queries

•  Recall matters when the number of matches is small

DD2476 Lecture 9, March 28, 2012

Users’ empirical evaluation of engines • Relevance and validity of results

• UI – Simple, no clutter, error tolerant

• Trust – Results are objective

• Coverage of topics for polysemic queries

• Pre/Post process tools provided - Mitigate user errors (auto spell check, search assist,…) -  Explicit: Search within results, more like this, refine ... -  Anticipative: related searches

• Deal with idiosyncrasies - Web specific vocabulary :) <3

•  Impact on stemming, spell-check, etc - Web addresses typed in the search box

DD2476 Lecture 9, March 28, 2012

The Web document collection • No design/co-ordination

• Distributed content creation, linking, democratization of publishing

• Content includes truth, lies, obsolete information, contradictions …

• Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)…

• Scale much larger than previous text collections

• Growth – slowed down from initial “volume doubling every few months” but still expanding

• Content can be dynamically generated

DD2476 Lecture 9, March 28, 2012

The Web!

Paid search ads cost money…

• What’s the alternative?

• Search Engine Optimization: -  “Tuning” your web page to rank highly in the algorithmic

search results for select keywords -  Alternative to paying for placement -  Thus, intrinsically a marketing function

• Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients

• Some perfectly legitimate, some very shady

DD2476 Lecture 9, March 28, 2012

Page 10: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Simplest forms

•  First generation engines relied heavily on tf/idf –  The top-ranked pages for the query maui resort were

the ones containing the most maui’s and resort’s

•  SEOs responded with dense repetitions of chosen terms –  e.g., maui resort maui resort maui resort –  Often, the repetitions would be in the same color as

the background of the web page •  Repeated terms got indexed by crawlers •  But not visible to humans on browsers

DD2476 Lecture 9, March 28, 2012

Pure word density cannot !be trusted as an IR signal!

Cloaking

DD2476 Lecture 9, March 28, 2012

Is this a Search Engine spider?

Y

N

SPAM

Real Doc

More spam techniques

• Doorway pages -  Pages optimized for a single keyword that re-direct to the

real target page

• Link spamming - Mutual admiration societies, hidden links, awards – more

on these later -  Domain flooding: numerous domains that point or re-

direct to a target page

•  Robots -  Fake query stream – rank checking programs

•  “Curve-fit” ranking programs of search engines

DD2476 Lecture 9, March 28, 2012

The war against spam

• Quality signals

• Policing of URL submissions

• Limits on meta-keywords

• Robust link analysis

• Spam recognition by machine learning

• Family friendly filters

• Editorial intervention

DD2476 Lecture 9, March 28, 2012

Page 11: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Web engine

The Web!

Ad indexes!

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider!

Indexer!

Indexes!

Search!

User!

DD2476 Lecture 9, March 28, 2012

Basic crawler operation

• Begin with known “seed” URLs

• Fetch and parse them -  Extract URLs they point to -  Place the extracted URLs on a queue

• Fetch each URL on the queue and repeat

DD2476 Lecture 9, March 28, 2012

Crawling picture

Web!

URLs crawled!and parsed!

URLs frontier!

Unseen Web!

Seed!pages!

DD2476 Lecture 9, March 28, 2012

Crawling complications

• Not feasible with one machine

• Malicious pages -  Spam pages -  Spider traps – incl dynamically generated

• Even non-malicious pages pose challenges -  Latency/bandwidth to remote servers vary - Webmasters’ stipulations

•  How “deep” should you crawl a site’s URL hierarchy?

-  Site mirrors and duplicate pages

• Politeness – don’t hit a server too often

DD2476 Lecture 9, March 28, 2012

Page 12: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Crawling – updated picture

URLs crawled!and parsed!

Unseen Web!

Seed!Pages!

URL frontier!

Crawling thread!

DD2476 Lecture 9, March 28, 2012

robots.txt

• Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 - www.robotstxt.org/wc/norobots.html

• Website announces its request on what can(not) be crawled -  For a URL, create a file URL/robots.txt -  This file specifies access restrictions

DD2476 Lecture 9, March 28, 2012

Processing steps in crawling

• Pick a URL from the frontier

• Fetch the document at the URL

• Parse the URL -  Extract links from it to other docs (URLs)

• Check if URL has content already seen -  If not, add to indexes

• For each extracted URL -  Ensure it passes certain URL filter tests -  Check if it is already in the frontier (duplicate URL

elimination)

DD2476 Lecture 9, March 28, 2012

E.g., only crawl .se, obey robots.txt, etc.!

Basic crawl architecture

WWW!

DNS!

Parse!

Content!seen?!

Doc!FP’s!

Dup!URL!elim!

URL!set!

URL Frontier!

URL!filter!

robots!filters!

Fetch!

DD2476 Lecture 9, March 28, 2012

Page 13: Web search, web crawling · Web search basics The Web! Ad indexes! Web Results 1-10 of about 7,310,000 for miele. (0.12 seconds) ... - Pages optimized for a single keyword that re-direct

Next

• Projects -  Plan your work autonomously -  Contact project proposer as soon as possible

• Lecture 10 (April 11, 15.15-17.00) -  Prof Viggo Kann -  L44

DD2476 Lecture 9, March 28, 2012