web search, web crawling · web search basics the web! ad indexes! web results 1-10 of about...
TRANSCRIPT
DD2476 Search Engines and Information Retrieval Systems Lecture 9: Clustering, Web Search, Web Crawling Hedvig Kjellström [email protected] www.csc.kth.se/DD2476
Today
• Document clustering - Motivations - Flat clustering (Manning Chapter 16.1-4) - Hierarchical clustering (Manning Chapter 17.1-3)
• Web search, web crawling - Characteristics of web search (Manning Chapter 19) - Web crawling (Manning Chapter 20)
DD2476 Lecture 9, March 28, 2012
Clustering
Recap: Classification
DD2476 Lecture 9, March 28, 2012
Government
Science
Arts
Data points have labels Classification task: Finding good separators
Sec.14.1!
What is Clustering?
• Clustering: the process of grouping a set of objects into classes of similar objects - Documents within a cluster should be similar - Documents from different clusters should be dissimilar
• The commonest form of unsupervised learning - Unsupervised learning = learning from raw unlabeled
data
• Important task in IR and other areas
DD2476 Lecture 9, March 28, 2012
Ch. 16!
Clear Cluster Structure
DD2476 Lecture 9, March 28, 2012
Ch. 16!
• How would you design an algorithm for finding the three clusters in this case?
Applications of Clustering in IR
• Clustering search results - Effective “user recall” will be higher, better navigation of
search results
• Scatter-gather - Better user interface: “search without typing”
• Visualizing a collection - Easier to browse
• Speeding up vector space retrieval - Cluster-based retrieval gives faster search
DD2476 Lecture 9, March 28, 2012
Sec. 16.1!
Clustering Search Results (yippy.com)
DD2476 Lecture 9, March 28, 2012
Sec. 16.1!
Scatter-Gather (Cutting, Karger and Pedersen)
DD2476 Lecture 9, March 28, 2012
Sec. 16.1!
Visualizing a Collection (Google News)
DD2476 Lecture 9, March 28, 2012
Visualizing a Collection (IN-SPIRE, Pacific Northwest National Laboratory)
DD2476 Lecture 9, March 28, 2012
Speeding up Vector Space Retrieval • Cluster hypothesis: Documents in the same cluster
behave similarly with respect to relevance to information needs
• Therefore, to improve search recall: - Cluster docs in corpus a priori - When a query matches a doc D, also return other docs in
the cluster containing D
• Hope if we do this: The query “car” will also return docs containing automobile - Because clustering earlier grouped together docs
containing car with those containing automobile.
DD2476 Lecture 9, March 28, 2012
Why might this happen?
Sec. 16.1!
Issues for Clustering
• Representation for clustering - Document representation (Vector space? Normalization?) - Need a notion of similarity/distance
• How many clusters? - Fixed a priori? - Completely data driven?
DD2476 Lecture 9, March 28, 2012
Sec. 16.2!
Notion of Similarity/Distance
• Ideal: semantic similarity - Semantially similar documents close - Semantically different documents far away
• Practical: term-statistical similarity - Cosine similarity - Documents normalized vectors in ND, N>>1
• As last week, visualize using Euclidean distance - For many algorithms, easier to think in terms of a
distance (rather than similarity) between docs - But real implementations use cosine similarity
DD2476 Lecture 9, March 28, 2012
Clustering Algorithms
• Flat algorithms - Usually start with a random (partial) partitioning - Refine it iteratively - K-means clustering - (Model based/probabilistic/EM clustering)
• Hierarchical algorithms - Bottom-up, agglomerative clustering - (Top-down, divisive clustering)
• More about clustering - DD2431 Machine Learning (p1) - DD2427 Image Based Recognition and Classification (p4)
DD2476 Lecture 9, March 28, 2012
Flat Clustering (Manning Chapter 16.1-4)
• Partitioning method: Partition a set of n documents into K clusters
• Given: - Set of n documents to be clustered - Number of clusters K
• Find: K clusters that - Globally minimize the intra-cluster distance - Globally maximize the inter-cluster distance
• Intractable for many objective functions (NP-hard) - Effective heuristic method: K-means
DD2476 Lecture 9, March 28, 2012
K-Means Example (K=2)
DD2476 Lecture 9, March 28, 2012
Sec. 16.4!
Pick seeds Reassign clusters Compute centroids
x x
Reassign clusters
x x x x Compute centroids
Reassign clusters
Converged!
K-means is great…
• …but what if K = 10,000?
• Basic problem – no structure - Which classes are similar to each other?
DD2476 Lecture 9, March 28, 2012
Hierarchical Clustering (Manning Chapter 17.1-3)
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents
• One approach: recursive application of a partitional clustering algorithm
DD2476 Lecture 9, March 28, 2012
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
Ch. 17!
Dendrogram: Hierarchical Clustering
• Clustering obtained by cutting the dendrogram at a desired level - Each connected
component forms a cluster
DD2476 Lecture 9, March 28, 2012
Hierarchical Agglomerative Clustering (HAC)
• Method for obtaining dendrogram
• General approach - Start with each document in a separate cluster - Repeatedly join the closest pair of clusters, until there is
only one cluster
• Details - How to find the closest pair of clusters
DD2476 Lecture 9, March 28, 2012
Sec. 17.1!
Closest pair of clusters
• Many variants to defining closest pair of clusters
• Single-link - Similarity of the most cosine-similar
• Complete-link - Similarity of the “furthest” points, the least cosine-similar
• Centroid - Clusters whose centroids (centers of gravity) are the
most cosine-similar
• Average-link - Average cosine between pairs of elements
DD2476 Lecture 9, March 28, 2012
Sec. 17.2!
Computational Complexity
• In the first iteration, similarity of all pairs of N instances: O(N2)
• In each of the N-2 merging iterations, compute distance between the most recent cluster and all other
- Worst case: O(N3)
- Cleverly: O(N2 log N)
DD2476 Lecture 9, March 28, 2012
Sec. 17.2.1!
What is a Good Clustering?
• Internal criterion • A good clustering will produce clusters where:
- The intra-class (intra-cluster) similarity is high - The inter-class similarity is low - The measured quality of a clustering depends on both the
document representation and the similarity measure used
DD2476 Lecture 9, March 28, 2012
Sec. 16.3!
What is a Good Clustering?
• External criterion • A good clustering is able to:
- Discover some or all hidden patterns or latent classes in gold standard/benchmark data
• Assess a clustering with respect to ground truth requires labeled data
• Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members.
DD2476 Lecture 9, March 28, 2012
Sec. 16.3!
External Evaluation
• Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster ωi
• Biased because having n clusters maximizes purity
• Other measures: Rand index, entropy/mutual information of classes in clusters
DD2476 Lecture 9, March 28, 2012
Cjnn
Purity ijji
i ∈= )(max1
)(ω
• • • • • •
• • • • • •
• • • • •
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5
Purity example
DD2476 Lecture 9, March 28, 2012
Sec. 16.3!
Web Search, Web Crawling
Brief history of web engines
• Early keyword-based engines ca. 1995-1997 - Altavista, Excite, Infoseek, Inktomi, Lycos - Measured how well documents fitted the query (inverted
index)
• Paid search ranking: Goto (morphed into Overture.com → Yahoo!) - Human categorization effort - Your search ranking depended on how much you paid
• 1998+: Link-based ranking pioneered by Google - Measures the authoritativeness of pages (PageRank) - Totally dominating today
DD2476 Lecture 9, March 28, 2012
Web search basics
The Web!
Ad indexes!
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web spider!
Indexer!
Indexes!
Search!
User!
DD2476 Lecture 9, March 28, 2012
User needs
• Need [Brod02, RL04] – Informational – want to learn about something (~40% /
65%)
– Navigational – want to go to that page (~25% / 15%)
– Transactional – want to do something (web-mediated) (~35% / 20%)
• Access a service
• Downloads
• Shop – Gray areas
• Find a good hub • Exploratory search “see what’s there”
Low hemoglobin
United Airlines
Seattle weather Mars surface images
Canon S410
Car rental Brasil
DD2476 Lecture 9, March 28, 2012
How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)!
DD2476 Lecture 9, March 28, 2012
Users’ empirical evaluation of results
• Quality of pages varies widely - Relevance is not enough - Other desirable qualities (non IR!!)
• Content: Trustworthy, diverse, non-duplicated, well maintained
• Web readability: display correctly & fast • No annoyances: pop-ups, etc
• Precision vs. recall - On the web, recall seldom matters
• What matters - Precision at 1? Precision above the fold? - Comprehensiveness – be able to deal with obscure queries
• Recall matters when the number of matches is small
DD2476 Lecture 9, March 28, 2012
Users’ empirical evaluation of engines • Relevance and validity of results
• UI – Simple, no clutter, error tolerant
• Trust – Results are objective
• Coverage of topics for polysemic queries
• Pre/Post process tools provided - Mitigate user errors (auto spell check, search assist,…) - Explicit: Search within results, more like this, refine ... - Anticipative: related searches
• Deal with idiosyncrasies - Web specific vocabulary :) <3
• Impact on stemming, spell-check, etc - Web addresses typed in the search box
DD2476 Lecture 9, March 28, 2012
The Web document collection • No design/co-ordination
• Distributed content creation, linking, democratization of publishing
• Content includes truth, lies, obsolete information, contradictions …
• Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)…
• Scale much larger than previous text collections
• Growth – slowed down from initial “volume doubling every few months” but still expanding
• Content can be dynamically generated
DD2476 Lecture 9, March 28, 2012
The Web!
Paid search ads cost money…
• What’s the alternative?
• Search Engine Optimization: - “Tuning” your web page to rank highly in the algorithmic
search results for select keywords - Alternative to paying for placement - Thus, intrinsically a marketing function
• Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients
• Some perfectly legitimate, some very shady
DD2476 Lecture 9, March 28, 2012
Simplest forms
• First generation engines relied heavily on tf/idf – The top-ranked pages for the query maui resort were
the ones containing the most maui’s and resort’s
• SEOs responded with dense repetitions of chosen terms – e.g., maui resort maui resort maui resort – Often, the repetitions would be in the same color as
the background of the web page • Repeated terms got indexed by crawlers • But not visible to humans on browsers
DD2476 Lecture 9, March 28, 2012
Pure word density cannot !be trusted as an IR signal!
Cloaking
DD2476 Lecture 9, March 28, 2012
Is this a Search Engine spider?
Y
N
SPAM
Real Doc
More spam techniques
• Doorway pages - Pages optimized for a single keyword that re-direct to the
real target page
• Link spamming - Mutual admiration societies, hidden links, awards – more
on these later - Domain flooding: numerous domains that point or re-
direct to a target page
• Robots - Fake query stream – rank checking programs
• “Curve-fit” ranking programs of search engines
DD2476 Lecture 9, March 28, 2012
The war against spam
• Quality signals
• Policing of URL submissions
• Limits on meta-keywords
• Robust link analysis
• Spam recognition by machine learning
• Family friendly filters
• Editorial intervention
DD2476 Lecture 9, March 28, 2012
Web engine
The Web!
Ad indexes!
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web spider!
Indexer!
Indexes!
Search!
User!
DD2476 Lecture 9, March 28, 2012
Basic crawler operation
• Begin with known “seed” URLs
• Fetch and parse them - Extract URLs they point to - Place the extracted URLs on a queue
• Fetch each URL on the queue and repeat
DD2476 Lecture 9, March 28, 2012
Crawling picture
Web!
URLs crawled!and parsed!
URLs frontier!
Unseen Web!
Seed!pages!
DD2476 Lecture 9, March 28, 2012
Crawling complications
• Not feasible with one machine
• Malicious pages - Spam pages - Spider traps – incl dynamically generated
• Even non-malicious pages pose challenges - Latency/bandwidth to remote servers vary - Webmasters’ stipulations
• How “deep” should you crawl a site’s URL hierarchy?
- Site mirrors and duplicate pages
• Politeness – don’t hit a server too often
DD2476 Lecture 9, March 28, 2012
Crawling – updated picture
URLs crawled!and parsed!
Unseen Web!
Seed!Pages!
URL frontier!
Crawling thread!
DD2476 Lecture 9, March 28, 2012
robots.txt
• Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 - www.robotstxt.org/wc/norobots.html
• Website announces its request on what can(not) be crawled - For a URL, create a file URL/robots.txt - This file specifies access restrictions
DD2476 Lecture 9, March 28, 2012
Processing steps in crawling
• Pick a URL from the frontier
• Fetch the document at the URL
• Parse the URL - Extract links from it to other docs (URLs)
• Check if URL has content already seen - If not, add to indexes
• For each extracted URL - Ensure it passes certain URL filter tests - Check if it is already in the frontier (duplicate URL
elimination)
DD2476 Lecture 9, March 28, 2012
E.g., only crawl .se, obey robots.txt, etc.!
Basic crawl architecture
WWW!
DNS!
Parse!
Content!seen?!
Doc!FP’s!
Dup!URL!elim!
URL!set!
URL Frontier!
URL!filter!
robots!filters!
Fetch!
DD2476 Lecture 9, March 28, 2012
Next
• Projects - Plan your work autonomously - Contact project proposer as soon as possible
• Lecture 10 (April 11, 15.15-17.00) - Prof Viggo Kann - L44
DD2476 Lecture 9, March 28, 2012