csm06 information retrieval lecture 4: web ir part 1 dr andrew salway [email protected]...
TRANSCRIPT
![Page 2: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/2.jpg)
Lecture 4: OVERVIEW
• Previously we looked at IR techniques that indexed a document based on the words that occur in the document
• Some of these techniques are applied in web search engines (but VSM may not be appropriate). However, web IR can also exploit a distinctive feature of information on the web – hypertext link structure
Use of anchor text for indexing web pages
The PageRank algorithm based on link structure analysis
Other techniques for ranking web pages
![Page 3: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/3.jpg)
Challenges for IR on the Web
• High volume of information• Heterogeneous information
(multimedia and multilingual)• Diverse users - hence diverse
information needs, and many inexperienced users
• Average query length 2-5 words• Poorly structured and low quality
information
![Page 4: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/4.jpg)
Scale
•Projection of worldwide Internet population in 2005 = 1.07 billion users, www.clickz.com/stats/web_worldwide/
•Early in 2005 Google claimed to index over 8 billion web pages, Yahoo recently claimed 19 billion, now Google claims to index 3 times more than nearest competitorhttp://select.nytimes.com/gst/abstract.html?res=F30610F93E540C748EDDA00894DD404482
•Given the low overlap in search engine results for a given query, it is likely that the total number of webpages is much greater than that indexed by any single web search engine
![Page 5: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/5.jpg)
Requirements of Web Search Engine Users?
• Fast response time• Some relevant results in first page;
maybe less concern with getting all relevant results
• Good coverage of web, at least of ‘important sites’
• Up-to-date links• Simple and intuitive to use – making
queries and understanding results
NB. Some of these requirements contrast with those of expert researchers using specialist information retrieval systems
![Page 6: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/6.jpg)
User Goals (Information Needs)
• Queries are used to express a user’s goal (or information need), but note that the same query might be used for quite different goals
(Rose and Levinson 2004)
![Page 7: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/7.jpg)
User Goals: Rose and Levinson’s classification (2004)
1. Navigational – wanting a specific known website
2. Informational – “my goal is to learn something by reading or viewing web pages” – e.g. closed and open-ended questions, advice
3. Resource – “my goal is to obtain a resource (not information) available on web pages” – e.g. download music, interact with online shopping service
NOTE: prior to web most IR was concerned only with Informational queries
![Page 8: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/8.jpg)
User Goals: Rose and Levinson’s classification (2004)
• The more a search engine understands about a user’s goal then the better results it can provide
User goals may be deduced not only from the query, but also from
• The results returned by the search engine
• Results clicked on by the user• Further searches / actions by the user
![Page 9: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/9.jpg)
Opportunity…
• Web search engines can exploit the fact that information on the web is in the form of hypertext…
![Page 10: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/10.jpg)
Hypertext
• The web is, in some senses at least, hypertextual, i.e. it can be viewed as networks of nodes (e.g. pages) and links (between pages)
![Page 11: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/11.jpg)
Hypertext
• Links suggest – relatedness of topic / perhaps also a recommendation
• Topological information about the hypertext graph gained by link structure analysis can be exploited for ranking
![Page 12: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/12.jpg)
Use of Anchor Text (Brin and Page 1998)
• Words in the anchor text can be used to index the webpage being linked to – the text in an anchor may give a good description of the page it points to, e.g.
<ahref=“www.bio.com/beckhambio.html"> A Biography of David Beckham</a></p>
• The words in the anchor text might be a better indicator of what the webpage is about than the words in the webpage
• Anchor text is also good for resources like images that can not be analysed as keywords
![Page 13: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/13.jpg)
PageRank (Brin and Page 1998)
• “Google makes use of both link structure and anchor text”
• “The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines”
PageRank is “an objective measure of [a web page’s] citation importance that corresponds well with people’s subjective idea of importance”
![Page 14: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/14.jpg)
Calculating PageRank
PR(A) = (1-d) + d*(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)
PR(A) = PageRank of webpage AC (A) = the number of links out of webpage AT1…Tn = the webpages that point to webpage Ad = a damping factor set between 0-1
In reality, the calculation of PageRank is iterative
![Page 15: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/15.jpg)
Web-adjacency Analysis (a similar idea to PageRank)
• Kleinberg and colleagues proposed a method for identifying authoritative web-pages– Identify set of relevant pages (as normal)– Identify those with a large in-degree, i.e. lots
of pages point to them (cf. ‘impact’)– Ensure that the authorities selected are
referred to by a number of the same hubs, i.e. those with a large out-degree
![Page 16: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/16.jpg)
Web-adjacency Analysis
• “Hubs and authorities exhibit what could be called a mutually reinforcing relationship” (Kleinberg 1998)
• Computing authority and hub values for web-pages is an iterative process over a graph, where each node is a web-page– Two weights are given to each node relating
to in-degree and out-degree: total in-degree weights and total out-degree weights are kept constant
– Weights are modified each iteration depending on weights of connected nodes
![Page 17: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/17.jpg)
Some other Factors used to rank Web Pages (Hock 2001)
• Popularity of the Page: measured either by how many other web-pages link to it, or by how many people have clicked on it when they had the same query
• Frequency of search terms: need to consider length of the document, and web-page authors attempts to affect ranking by deliberate repetition
• Number of query terms matched: but remember many queries are only one or two words
![Page 18: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/18.jpg)
Other Factors (continued…)
• Rarity of terms: rank pages containing rare search terms more highly (cf. TFIDF)
• Weighting by Field: give high ranking to pages including search terms in important fields, e.g. Title
• Proximity of Terms: rank pages more highly if search terms occur near one another
• Order of Query Terms: give priority to pages containing the search term entered first
![Page 19: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/19.jpg)
Set Reading for Lecture 4
• Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. SECTIONS 1 and 2. Explains Google’s use of anchor text and PageRank.
www-db.stanford.edu/~backrub/google.html
• Hock (2001), The extreme searcher's guide to web search engines, pages 25-31. Gives an overview of some factors used by web search engines to rank webpages. AVAILABLE in Main Library collection and in Library Article Collection.
![Page 20: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/20.jpg)
Exercise
• Explore the idea of PageRank using an online PageRank calculator, e.g.
www.markhorrell.com/seo/pagerank.shtml
OR
www.webworkshop.net/pagerank_calculator.php3
![Page 21: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/21.jpg)
Further ReadingRose and Levinson (2004), “Understanding User Goals in Web Search”, 13th
International WWW Conference, 2004. www.sims.berkeley.edu/courses/is141/f05/readings/rose_www04.pdf
Page, Brin, Motwani and Winograd (1999), “The PageRank Citation Ranking: Bringing Order to the Web.” http://dbpubs.stanford.edu:8090/pub/1999-66
Belew (2000), Finding Out About, pages 195-199 for an overview of Kleinberg’s work on web-adjacency analysis and authorities and hubs.
Kleinberg (1998), ‘Authoritative Sources in a Hyperlinked Environment’, Journal of the ACM. http://citeseer.nj.nec.com/87928.html
Kobayashi and Takeda (2000), “Information Retrieval on the Web”, ACM Computing Surveys 32(2), pp. 144-173. AVAILABLE IN LIBRARY / ARTICLE COLLECTION. **This comprehensive article reviews a lot the ideas covered so far in this module and discusses them in the context of Web IR. NOTE, it is already a little out of date in places because of the rapid changes of the Web.
![Page 22: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/22.jpg)
Lecture 4: LEARNING OUTCOMES
After this lecture you should be able to:• Explain how the challenges of web IR are
different than those facing the developers of traditional IR systems
• Explain how web search engines can exploit the hypertext structure of the web to index and rank web pages, e.g. using Anchor Text, and PageRank
• Explain how PageRank is calculated• Discuss and critique a range of factors
used by web search engines to rank web pages
![Page 23: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk](https://reader033.vdocuments.net/reader033/viewer/2022051416/56649ea25503460f94ba5de2/html5/thumbnails/23.jpg)
Reading ahead for LECTURE 5If you want to read about next week’s lecture topics,
see:
Dean and Henzinger (1999), ‘Finding Related Pages in the World Wide Web’. Pages 1-10.
http://citeseer.ist.psu.edu/dean99finding.html
Agichtein, Lawrence and Gravano (2001), ‘Learning Search Engine Specific Query Transformations for Question Answering’, Procs. 10th International WWW Conference. **Section 1 and Section 3**
www.cs.columbia.edu/~eugene/papers/www10.pdf
Oppenheim, Morris and McKnight (2000), ‘The Evaluation of WWW Search Engines’, Journal of Documentation, 56(2). Pages 194-205. In Library Article Collection.