web search, web crawling · web search basics the web! ad indexes! web results 1-10 of about...

DD2476 Search Engines and Information Retrieval Systems Lecture 9: Clustering, Web Search, Web Crawling Hedvig Kjellström [email protected] www.csc.kth.se/DD2476

Today

• Document clustering - Motivations -  Flat clustering (Manning Chapter 16.1-4) -  Hierarchical clustering (Manning Chapter 17.1-3)

• Web search, web crawling -  Characteristics of web search (Manning Chapter 19) - Web crawling (Manning Chapter 20)

DD2476 Lecture 9, March 28, 2012

Clustering

Recap: Classification


Government

Science

Arts

Data points have labels Classification task: Finding good separators

Sec.14.1!

What is Clustering?

• Clustering: the process of grouping a set of objects into classes of similar objects -  Documents within a cluster should be similar -  Documents from different clusters should be dissimilar

• The commonest form of unsupervised learning -  Unsupervised learning = learning from raw unlabeled

data

• Important task in IR and other areas


Ch. 16!

Clear Cluster Structure


Ch. 16!

• How would you design an algorithm for finding the three clusters in this case?

Applications of Clustering in IR

• Clustering search results -  Effective “user recall” will be higher, better navigation of

search results

• Scatter-gather -  Better user interface: “search without typing”

• Visualizing a collection -  Easier to browse

• Speeding up vector space retrieval -  Cluster-based retrieval gives faster search


Sec. 16.1!

Clustering Search Results (yippy.com)


Sec. 16.1!

Scatter-Gather (Cutting, Karger and Pedersen)


Sec. 16.1!

Visualizing a Collection (Google News)


Visualizing a Collection (IN-SPIRE, Pacific Northwest National Laboratory)


Speeding up Vector Space Retrieval • Cluster hypothesis: Documents in the same cluster

behave similarly with respect to relevance to information needs

• Therefore, to improve search recall: -  Cluster docs in corpus a priori - When a query matches a doc D, also return other docs in

the cluster containing D

• Hope if we do this: The query “car” will also return docs containing automobile -  Because clustering earlier grouped together docs

containing car with those containing automobile.


Why might this happen?

Sec. 16.1!

Issues for Clustering

• Representation for clustering -  Document representation (Vector space? Normalization?) -  Need a notion of similarity/distance

• How many clusters? -  Fixed a priori? -  Completely data driven?


Sec. 16.2!

Notion of Similarity/Distance

• Ideal: semantic similarity -  Semantially similar documents close -  Semantically different documents far away

• Practical: term-statistical similarity -  Cosine similarity -  Documents normalized vectors in ND, N>>1

• As last week, visualize using Euclidean distance -  For many algorithms, easier to think in terms of a

distance (rather than similarity) between docs -  But real implementations use cosine similarity


Clustering Algorithms

• Flat algorithms -  Usually start with a random (partial) partitioning -  Refine it iteratively -  K-means clustering -  (Model based/probabilistic/EM clustering)

• Hierarchical algorithms -  Bottom-up, agglomerative clustering -  (Top-down, divisive clustering)

• More about clustering -  DD2431 Machine Learning (p1) -  DD2427 Image Based Recognition and Classification (p4)


Flat Clustering (Manning Chapter 16.1-4)

• Partitioning method: Partition a set of n documents into K clusters

• Given: -  Set of n documents to be clustered -  Number of clusters K

• Find: K clusters that - Globally minimize the intra-cluster distance - Globally maximize the inter-cluster distance

• Intractable for many objective functions (NP-hard) -  Effective heuristic method: K-means


K-Means Example (K=2)


Sec. 16.4!

Pick seeds Reassign clusters Compute centroids

x x

Reassign clusters

x x x x Compute centroids

Reassign clusters

Converged!

K-means is great…

• …but what if K = 10,000?

• Basic problem – no structure - Which classes are similar to each other?


Hierarchical Clustering (Manning Chapter 17.1-3)

• Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents

• One approach: recursive application of a partitional clustering algorithm


animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Ch. 17!

Dendrogram: Hierarchical Clustering

• Clustering obtained by cutting the dendrogram at a desired level -  Each connected

component forms a cluster


Hierarchical Agglomerative Clustering (HAC)

• Method for obtaining dendrogram

• General approach -  Start with each document in a separate cluster -  Repeatedly join the closest pair of clusters, until there is

only one cluster

• Details -  How to find the closest pair of clusters


Sec. 17.1!

Closest pair of clusters

• Many variants to defining closest pair of clusters

• Single-link -  Similarity of the most cosine-similar

• Complete-link -  Similarity of the “furthest” points, the least cosine-similar

• Centroid -  Clusters whose centroids (centers of gravity) are the

most cosine-similar

• Average-link -  Average cosine between pairs of elements


Sec. 17.2!

Computational Complexity

• In the first iteration, similarity of all pairs of N instances: O(N2)

• In each of the N-2 merging iterations, compute distance between the most recent cluster and all other

- Worst case: O(N3)

-  Cleverly: O(N2 log N)


Sec. 17.2.1!

What is a Good Clustering?

• Internal criterion • A good clustering will produce clusters where:

-  The intra-class (intra-cluster) similarity is high -  The inter-class similarity is low -  The measured quality of a clustering depends on both the

document representation and the similarity measure used


Sec. 16.3!

What is a Good Clustering?

• External criterion • A good clustering is able to:

-  Discover some or all hidden patterns or latent classes in gold standard/benchmark data

• Assess a clustering with respect to ground truth requires labeled data

• Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members.


Sec. 16.3!

External Evaluation

• Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster ωi

• Biased because having n clusters maximizes purity

• Other measures: Rand index, entropy/mutual information of classes in clusters


Cjnn

Purity ijji

i ∈= )(max1

)(ω

• • • • • •

• • • • • •

• • • • •

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Purity example


Sec. 16.3!

Web Search, Web Crawling

Brief history of web engines

• Early keyword-based engines ca. 1995-1997 -  Altavista, Excite, Infoseek, Inktomi, Lycos - Measured how well documents fitted the query (inverted

index)

• Paid search ranking: Goto (morphed into Overture.com → Yahoo!) -  Human categorization effort -  Your search ranking depended on how much you paid

• 1998+: Link-based ranking pioneered by Google - Measures the authoritativeness of pages (PageRank) -  Totally dominating today


Web search basics

The Web!

Ad indexes!

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider!

Indexer!

Indexes!

Search!

User!


User needs

•  Need [Brod02, RL04] –  Informational – want to learn about something (~40% /

65%)

–  Navigational – want to go to that page (~25% / 15%)

–  Transactional – want to do something (web-mediated) (~35% / 20%)

•  Access a service

•  Downloads

•  Shop –  Gray areas

•  Find a good hub •  Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Seattle weather Mars surface images

Canon S410

Car rental Brasil


How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)!


Users’ empirical evaluation of results

• Quality of pages varies widely -  Relevance is not enough - Other desirable qualities (non IR!!)

•  Content: Trustworthy, diverse, non-duplicated, well maintained

•  Web readability: display correctly & fast •  No annoyances: pop-ups, etc

• Precision vs. recall - On the web, recall seldom matters

• What matters -  Precision at 1? Precision above the fold? -  Comprehensiveness – be able to deal with obscure queries

•  Recall matters when the number of matches is small


Users’ empirical evaluation of engines • Relevance and validity of results

• UI – Simple, no clutter, error tolerant

• Trust – Results are objective

• Coverage of topics for polysemic queries

• Pre/Post process tools provided - Mitigate user errors (auto spell check, search assist,…) -  Explicit: Search within results, more like this, refine ... -  Anticipative: related searches

• Deal with idiosyncrasies - Web specific vocabulary :) <3

•  Impact on stemming, spell-check, etc - Web addresses typed in the search box


The Web document collection • No design/co-ordination

• Distributed content creation, linking, democratization of publishing

• Content includes truth, lies, obsolete information, contradictions …

• Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)…

• Scale much larger than previous text collections

• Growth – slowed down from initial “volume doubling every few months” but still expanding

• Content can be dynamically generated


The Web!

Paid search ads cost money…

• What’s the alternative?

• Search Engine Optimization: -  “Tuning” your web page to rank highly in the algorithmic

search results for select keywords -  Alternative to paying for placement -  Thus, intrinsically a marketing function

• Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients

• Some perfectly legitimate, some very shady


Simplest forms

•  First generation engines relied heavily on tf/idf –  The top-ranked pages for the query maui resort were

the ones containing the most maui’s and resort’s

•  SEOs responded with dense repetitions of chosen terms –  e.g., maui resort maui resort maui resort –  Often, the repetitions would be in the same color as

the background of the web page •  Repeated terms got indexed by crawlers •  But not visible to humans on browsers


Pure word density cannot !be trusted as an IR signal!

Cloaking


Is this a Search Engine spider?

Y

N

SPAM

Real Doc

More spam techniques

• Doorway pages -  Pages optimized for a single keyword that re-direct to the

real target page

• Link spamming - Mutual admiration societies, hidden links, awards – more

on these later -  Domain flooding: numerous domains that point or re-

direct to a target page

•  Robots -  Fake query stream – rank checking programs

•  “Curve-fit” ranking programs of search engines


The war against spam

• Quality signals

• Policing of URL submissions

• Limits on meta-keywords

• Robust link analysis

• Spam recognition by machine learning

• Family friendly filters

• Editorial intervention


Web engine

The Web!

Ad indexes!

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider!

Indexer!

Indexes!

Search!

User!


Basic crawler operation

• Begin with known “seed” URLs

• Fetch and parse them -  Extract URLs they point to -  Place the extracted URLs on a queue

• Fetch each URL on the queue and repeat


Crawling picture

Web!

URLs crawled!and parsed!

URLs frontier!

Unseen Web!

Seed!pages!


Crawling complications

• Not feasible with one machine

• Malicious pages -  Spam pages -  Spider traps – incl dynamically generated

• Even non-malicious pages pose challenges -  Latency/bandwidth to remote servers vary - Webmasters’ stipulations

•  How “deep” should you crawl a site’s URL hierarchy?

-  Site mirrors and duplicate pages

• Politeness – don’t hit a server too often


Crawling – updated picture

URLs crawled!and parsed!

Unseen Web!

Seed!Pages!

URL frontier!

Crawling thread!


robots.txt

• Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 - www.robotstxt.org/wc/norobots.html

• Website announces its request on what can(not) be crawled -  For a URL, create a file URL/robots.txt -  This file specifies access restrictions


Processing steps in crawling

• Pick a URL from the frontier

• Fetch the document at the URL

• Parse the URL -  Extract links from it to other docs (URLs)

• Check if URL has content already seen -  If not, add to indexes

• For each extracted URL -  Ensure it passes certain URL filter tests -  Check if it is already in the frontier (duplicate URL

elimination)


E.g., only crawl .se, obey robots.txt, etc.!

Basic crawl architecture

WWW!

DNS!

Parse!

Content!seen?!

Doc!FP’s!

Dup!URL!elim!

URL!set!

URL Frontier!

URL!filter!

robots!filters!

Fetch!


Next

• Projects -  Plan your work autonomously -  Contact project proposer as soon as possible

• Lecture 10 (April 11, 15.15-17.00) -  Prof Viggo Kann -  L44