autumn 20111 web information retrieval (web ir) handout #1:web characteristics ali mohammad zareh...
TRANSCRIPT
![Page 1: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/1.jpg)
Autumn 2011 1
Web Information retrieval (Web IR)
Handout #1:Web characteristics
Ali Mohammad Zareh BidokiECE Department, Yazd University
![Page 2: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/2.jpg)
Autumn 2011 2
Outline
• Web challenges• SE & Web IR challenges• Web Structure (Graph)• Web characteristics• Zip law
![Page 3: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/3.jpg)
Autumn 2011 3
Web Challenges
• Huge size of information– 11.5 billions pages (2005)– 64 billions pages (05 June, 2008)
• Proliferation and dynamic nature– New pages are created at the rate of 8% per week– Only 20% of the current pages will be accessible after one
year – New links are created at rate 25% per week
• Heterogeneous contents– HTML/Text/Audio/…
• Users of web are growing exponentially
![Page 4: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/4.jpg)
Autumn 2011 4
What is the success reason of the Web?
• A distributed system• A simple protocol• Production and generation is very
simple
![Page 5: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/5.jpg)
Autumn 2011 5
Information Retrieval Definition
• IR deals with the representation, storage, organization of, and access to information items (relevant to user query)
• Information retrieval (IR) is the science of searching for documents, for information within documents, and for metadata about documents
• An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines.
• In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.
![Page 6: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/6.jpg)
Autumn 2011 6
Web Retrieval
User Space
User Space
Information Space
Information Space
Matching
RetrievalBrowsing
Index termsFull text
Full text + Structure (e.g. hypertext)
Search Engine
Search engine is an IR system!
![Page 7: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/7.jpg)
Autumn 2011 7
IR vs Data Retrieval
• A data retrieval aims at retrieving all objects which satisfy clearly defined conditions in regular expression
• DR does not solve the problem of retrieving information about subject or object
![Page 8: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/8.jpg)
Autumn 2011 8
Comparing IR to databases (vs data
retrieval)
Databases IR
Data Structured Unstructured
Fields Clear semantics (SSN, age)
No fields (other than text)
QueriesDefined (relational algebra, SQL)
Free text (“natural language”), Boolean
Query specification
Complete Incomplete
MatchingExact (results are always “correct”)
Imprecise (need to measure effectiveness)
Error response
Sensitive Insensitive
![Page 9: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/9.jpg)
Autumn 2011 9
Main points in IR
• What is the definition of relevancy?• Evaluation!
– Subjective (opposite to hardware, network)
![Page 10: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/10.jpg)
Autumn 2011 10
Web IR (SE) Challenges (1)
• The definition of Relevancy• The connectivity with content in Web
– A huge graph
• Different type of Queries– Narrow
• Needle in a haystack– Wide
• Overlapping with many areas• User have Poor patience: they commonly browse
through the first ten results (i.e. one screen) hoping to find there the “right” document for their query
![Page 11: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/11.jpg)
Autumn 2011 11
Web IR (SE) Challenges (2)
• Spamming phenomenon– it is crucial for business sites to be ranked highly by
the major search engines. – There are quite a few companies who sell this kind of
expertise (also known as “search engine optimization”) and actively research ranking algorithms and heuristics of search engines, and know how many keywords to place (and where) in a Web page so as to improve the page’s ranking
– SEO Books
• Content & Connectivity Spamming• Anti Spamming solutions
![Page 12: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/12.jpg)
Autumn 2011 12
Web IR (SE) Challenges (3)
• Rich-get-richer problem– It takes a long time for a young high quality web
pages to receive an appropriate quality– Unfairness– Bad directions in growing web contents
![Page 13: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/13.jpg)
Autumn 2011 13
Web IR (SE) Challenges (4)
• Crawling challenges– Huge size of information with dynamic nature– Freshness & converge
• Google covers only 70% of the Web
– An suitable scheduling policy– Hidden web (600 times bigger)
• Using meta search engines to increase coverage– Merging and ranking problem
![Page 14: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/14.jpg)
Autumn 2011 14
Web IR (SE) Challenges (5)
• User evaluation is subjective and changes in time– Relevancy between a query and document depends
on user and time– Two users with the same query expect different
results
![Page 15: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/15.jpg)
Autumn 2011 15
Web IR (SE) Challenges (6)
• Query Ambiguity– Python– Car & automobile
![Page 16: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/16.jpg)
Autumn 2011 16
Web Dynamics
• For each page p and each visit, the following information is available:– The access time-stamp of the page: visitp.– The last-modified time-stamp (given by mostWeb servers;
about 80%-90%of the requests in practice): modifiedp.– The text of the page, which can be compared to an older
copy to detect changes, especially if modifiedp– is not provided.– The following information can be estimated if the re-
visiting period is short:– The time at which the page first appeared: createdp.– The time at which the page was no longer reachable:
deletedp
• In all cases, the results are only an estimation of the actual values
![Page 17: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/17.jpg)
Autumn 2011 17
Estimating freshness and age
• The probability that a copy of p is up-to-date at time t, up(t) decreases with time if the page is not re-visited.
• When page changes are modeled as a Poisson process, if t units of time have passed since the last visit, then:
![Page 18: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/18.jpg)
Autumn 2011 18
Characterization of Web page changes
• Age: visitp-modifiedp.• Lifespan: deletedp-createdp.• Number of changes during the lifespan:
changesp.• Average change interval:
lifespanp/changesp.
![Page 19: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/19.jpg)
Autumn 2011 19
Freshness && Age
![Page 20: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/20.jpg)
Autumn 2011 20
![Page 21: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/21.jpg)
Autumn 2011 21
Web a Scale Free Network
• A scale-free network is characterized by a few highly-linked nodes that act as “hubs” connecting several nodes to the network.
• It follows Power Law
![Page 22: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/22.jpg)
Autumn 2011 22
Random Vs Scale-Free
![Page 23: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/23.jpg)
Autumn 2011 23
Distribution of Web Graph: Power-Law
![Page 24: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/24.jpg)
Autumn 2011 24
Power-Law and Zipf Law
xxP )(
![Page 25: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/25.jpg)
Autumn 2011 25
Zipf Law for Content
![Page 26: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/26.jpg)
Autumn 2011 26
Macroscopic Structure of Web
![Page 27: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/27.jpg)
Autumn 2011 27
User Sessions
• User sessions on the Web are usually characterized through models of random surfers
• The most used source for data about the browsing activities of users are the access log files of Web Servers, Proxies, SEs– Caching
• Modeling User behavior• Eye tracking
![Page 28: Autumn 20111 Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir](https://reader034.vdocuments.net/reader034/viewer/2022042822/56649f0e5503460f94c23125/html5/thumbnails/28.jpg)
Autumn 2011 28
Next Lecture
• Information Retrieval Models– Boolean– Vector Space– Realistic