web search and p2p advisor: dr sushil prasad presented by: dm rasanjalee himali

WEB SEARCH and P2P

Advisor: Dr Sushil PrasadPresented By: DM Rasanjalee Himali

OUTLINEIntroduction to web search engines

What is a web search engine?Web Search engine architectureHow a web search engine work?Relevance and Ranking

Limitations in current Web Search Engines

P2P Web Search EnginesYouSearchCopeerODISSEA

Conclusion

What is a web search engine?

A Web search engine is a search engine designed to search for information on the World Wide Web. Information may consist of web pages, images and other types of files. Some search engines also mine data available in newsgroups, databases, or open directories

History…Before there were search engines there was a complete list of all webservers.

The very first tool used for searching on the Internet was Archie

downloaded directory listings of files on FTP sitesdid not index the contents of these sites

Soon after, many search engines appeared

Excite, Infoseek, Northern Light, AltaVista. Yahoo!, Google, MSN Search

CompanyMillions of

searches

Relative market share

Google 28,454 46.47%

Yahoo! 10,505 17.16%

Baidu 8,428 13.76%

Microsoft 7,880 12.87%

NHN 2,882 4.71%

eBay 2,428 3.9%

Time Warner 1,062 1.6%

Ask.com 728 1.1%

Yandex 566 0.9%

Alibaba.com 531 0.8%

Total 61,221 100.0%

How Web Search Engine Work

A search engine operates, in the following order

Web crawling Indexing Searching

Web Crawling

A web crawlera program or which browses the World Wide Web in a methodical, automated manner. a means of providing up-to-date data create a copy of all the visited pages for later processing by a search engine starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

Robot Exclusion Protocolalso known as the robots.txt protocolis a convention to prevent cooperating web robots from accessing all or part of a website which is otherwise publicly viewable.

User-agent : * Disallow : /cgi-bin/ Disallow : /images/ Disallow : /tmp/ Disallow : /private/Sitemap : http://www.example.com/sitemap.xml.gz Crawl-delay : 10 Allow : /folder1/myfile.html Request-rate : 1/5 # maximum rate is one page every 5 seconds Visit-time : 0600-0845 # only visit between 06:00 and 08:45 UTC (GMT)

It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy.

The standard complements Sitemaps, a robot inclusion standard for websites.

SiteMap Protocolallows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site.It allows webmasters to include additional information about each URL:

when it was last updated, how often it changes, and how important it is in relation to other URLs in the site.

This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol complement robots.txt <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url>

<loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority>

</url> </urlset>

Indexing

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. The contents of each page are analyzed to determine how it should be indexed Ex: words are extracted from the titles, headings, or special fields called meta tagsMeta search engines reuse the indices of other services and do not store a local index

Inverted Indicesinverted index stores a list of the documents containing each word

search engine can use direct access to find the documents associated with each word in the query to

retrieve the matching documents quickly

Word Documents

the Document 1, Document 3, Document 4, Document 5

cow Document 2, Document 3, Document 4

says Document 5

moo Document 7

Searchingweb search query

a query that a user enters into web search engine to satisfy his or her information needs. is distinctive in that it is unstructured and often ambiguousvary greatly from standard query languages which are governed by strict syntax rules.

Web search engine architecture

From “The Anatomy of a Large-Scale HypertextualWeb Search Engine”Sergey Brin and Lawrence Page

URL List

Fetched pages

Compress + store

- Read repository- Uncompress + parse docs to hit list- Distribute hit list to baralles by docID- Parse out links and store in anchor file

Partiall sorted forward index

links

Anchors file

read-Relative URLs- absolute URLs docIDs

Anchor text --docIDs

Calculate PR of all docs

Resort baralls by word IDs lexicon

Inverted index

PR Answer queries

Relevance and RankingExactly how a particular search engine's algorithm works is a closely-kept trade secret. However, all major search engines follow the general rules below.

Location, Location, Location...and FrequencyLocation:

Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning.

Frequency: A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages.

Precision and Recalltwo widely used measures for evaluating the quality of results in Information Retrieval Precision

fraction of the documents retrieved that are relevant to the user's information need= number of relevant documents retrieved by a search ___________________________________________________ the total number of documents retrieved by that search

Recallthe fraction of the documents that are relevant to the query that are successfully retrieved. = number of relevant documents retrieved by a search _____________________________________________________ the total number of existing relevant documents which should have been retrieved

Often, there is an inverse relationship between Precision and Recall

Relevance and Rankingwebmasters constantly rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters may even "reverse engineer" the location/frequency systems used by a particular search engineBecause of this, all major search engines now also make use of "off the page" ranking criteria

Relevance and RankingOff the page factors:

those that a webmasters cannot easily influence

Link analysisSearch engine analyzing how pages link to each otherHelps to determine what a page is about and whether that page is "important" and thus deserving of a ranking boost

Click through measurementa search engine watch what results someone selects for a particular search,eventually drop high-ranking pages that aren't attracting clicks, promote lower-ranking pages that do pull in visitors.

Limitations in current web search engines

Centralized search engines have limited scalability.

crawler based indices are stale and incomplete

Fundamental issue: How much of the web is ‘crawlable’

If you follow the rules many sites say “robots get lost”What about Dynamic content? (Deep Web)The deep web is around 500 times larger than surface web. These deep web resources mainly include data held by databases which can be accessed only through queries. Since crawlers discover resources only through links, they cannot discover these resources. There’s no guarantee that current search engines index or even crawl the total surface web space

Limitations in current web search engines

Single point of failure

Ambiguous ‘words’ Polysemy - words with multiple meanings “train car” “train neural network”Synonymy - multiple words same meaning: “neural network is trained as follows” “neural network learns as follows”

What about ‘phrases’ - searches are not ‘bag of words’Positional information? Structural (throw out case & punctuation)?

Non-text content data worth storingMost web search engines today crawl only surface web.

P2P Web SearchSeen explosion of activity in the area of peer-to-peer (P2P) systems last few years

Since an increasing amount of content now resides in P2P networks, it becomes necessary to provide search facilities within P2P networks.

The significant computing resources provided by a P2P system could also be used to implement search and data mining functions for content located outside the system

e.g., for search and mining tasks across large intranets or global enterprises, or even to build a P2P-based alternative to the current major search engines.

P2P Web Search

The characteristics distinguish P2P systems from previous technologies:- low maintenance overhead - improved scalability - Improved reliability - synergistic performance - increased autonomy and privacy - Dynamism

P2P Web Search Engines

YouSearchCoopeerODISSEA

YouSearchYouSearch :

is a distributed search application for personal webservers operating within a shared contextAllow peers to aggregate into groups and users to search over specific groups

Goal: Provide fast, fresh and complete results to users

YouSearchSystem Overviewparticipants in YouSearch:

Peer-nodesrun YouSearch enabled clients

Browserssearch YouSearch enabled content through their web browsers

Registrarcentralized light-weight service that acts like a “blackboard” on which peer nodes store and lookup (summarized) network state.

YouSearchSystem OverviewSearch System:

Each peer node closely monitors its own content to maintain a fresh local indexA bloom filter content summary is created by each peer and pushed to the registrar.When a browser issues a search query at a peer p , the peer p first queries the summaries at the registrar to obtain a set of peers R in the network that are hosting relevant documents.The peers in R are then directly contacted by with the query to obtain the URLs for its results.To quickly satisfy any subsequently issued queries with identical terms, the results from each query issued at a peer p are cached for a limited time at p

YouSearchIndexing

Indexing is periodically executed at every peer node. Inspector examines each shared file for its last modification date and time. If the file is new or the file has changed, the file is passed to the Indexer. The Indexer maintains a disk-based inverted-index over the shared content. The name and path information of the file are indexed as well.

YouSearchIndexingSummarizer:

The Summarizer obtains a list of terms T from the Indexer and creates a bloom filter from them in the following way. A bit vector V of length L is created with each bit set to 0. A specified hash function H with range {1,...,L} is used to hash each term t in T and the bit at position H(t) in V is set to 1 YouSearch use k independent hash functions H1,H2,...,Hk and construct k different bloom filters, one for each hash functionIn YouSearch,

the length of each bloom filter is L = 64 Kbits and the number of bloom filters k is set to 3

Summary Manager at the registrar aggregate these Bloom Filters into a structure that maps each bit position to a set of peers whose Bloom Filters have the corresponding bit set

YouSearchQuerying

quer

y

quer

y

keywords

keyw

ords

computes the hash of keywords

Corresponding bits of each k bloom filters

determine

Bit position to IP address mapping

look

s up

m

appi

ng

inte

rsec

tion

of

peer

I s

inte

rsec

tion

of

peer

I s

contacts each of the peers in list and obtains a list of URLs for matching documents

results

YouSearchCachingEvery time a global query is answered that returns non-zero results, the querying peer caches the result set of URLs U (temporary)

The peer then informs the registrar of the fact.

The registrar adds a mapping from the query to the IP-address of the caching peer in its cache table

YouSearch

Limitations:False Positive results : 17.38%Central registrar >> single point of failureNo extensive phrase search No attention has been given for query rankingNo human user collaboration

CoopeerCoopeer:

Is a P2P web search engine where each user computer stores a part of the web model used for indexing and retrieving web resources in response to queries

Goal:complement centralized search engines to provide more humanized and personalized results by utilizing users’ collaboration

Coopeer(a)Collaboration

One may look for interesting web pages in the P2P knowledge repository consisted with shared web pages. A novel collaborative filtering technique called PeerRank is presented to rank pages proportional to the votes from relevant peers;

(b)HumanizationCoopeer use a query-based representation for documents, The relevant words are not directly extracted from page content but introduced by human users with a high proficiency in their expertise domains.

(c)PersonalizationSimilar users are self-organized according to their semantic content of search session. Thus, requestor peer can extend routing paths along its neighbors, rather than just take a blind shot. User-customized results can be obtained along personal routing paths in contrast with CSEs.

CoopeerSystem Overviewrequestor forwards the query based on the semantically routing. Peers maintain a local index about the semantic content of remote peers. Receiving a query message from remote peer, current peer check it against the local store. In order to facilitate this work, a novel query-based representation about documents is introduced. Based on query representation, cosine similarity between new query and documents can be computed.the documents are relevant enough, if the similarity exceeds a certain threshold. Then these results are returned to the requestor.Receiving the returned results, the requestor peer need to rank them in term of preference of its human owner using PeerRank method.

CoopeerThe Coopeer client consists of four main software agents:

1. The User Agent is responsible for interacting with the users.It provides a friendly user interface, so that users can conveniently manage and manipulate the whole search sessions.

2. The Web-searcher Agent is the resource of P2P knowledge repository. It performs the user’s individual searching with several search engines from the Internet.

3. The Collaborator Agent is the key component for performing users’ real-time collaborative searching. It facilitates maintaining the P2P knowledge repository, such as information sharing, searching, and fusion.

4. The Manager Agent is the key component of Coopeer, which coordinates and manage the other types of agents. It is also responsible for updating and maintaining data.

CoopeerPeerRankAll the users are taken as a ”Referrer Network”. Determines page’s relevance by examining a radiating network of ”referrers”. Documents with more referrers gain higher ranks. Obtain better rank order, as collaborative evaluation of human users is much more precise than description of term frequency or link amount. Prevent spam, since it is difficult to pretend evaluation from human users.

CoopeerPeerRank

For a given search session, we firstly compute the similarity between requestor’s favorite lists and referrer’s, then the similarity is used as the baseline of recommending degree of the referrer. Firstly, as shown in equation (1), the similarity of local list and recommended list is given by the Kendall measure. Secondly, we convert the rank of a given URL in its recommended list to a moderate score

R(e) - weight of URL e.C (e) - set constituted by e’s referrers. Z - constant > 1. p - local peerPi - a remote peer,Lp , Lpi - list of p and Pi respectively. K(r)(Lp, Lpi ) -Kendall function to measure the distance of the local list and the recommended list, r – decay factor. SLpi(e) - score of e in the recommended list.Re - rank of e and RMax - highest rank of list pi, = the length of the list.

Coopeer

Kendall MeasureKendall is used to measure the distance between two lists in the same length. Paper extend it to fit in with measuring two lists in different length. Kendall function :

τ1 and τ2 - two lists composed with URLs

Kr(τ1, τ2) -the distance between τ1 and τ2, r – fixed parameter with 0 ≤ r ≤ 1. C22L - used for normalization is the possible maximum of the distance. U(τ1, τ2) - set consists of all the URLs in τ1 and τ2, K’ ri,j(τ1, τ2) - means the penalty of the URL pair (i, j)

CoopeerQuery Based Representation

A novel type of representation based on the relevant words introduced by human users with a high proficiency in their expertise domains.is efficient on the P2P platform, as the user’s evaluation can be utilized easily through the client application.represent and organize the local documents for responding remote query

CoopeerEach peer maintains:

an inverted index tablerepresent local documents for responding remote querythe IDs of the documents that were replied the query

key of inverted index is terms extracted from the previous queries

Ex: when peer j writes in two queries ”P2P Overlay” and ”P2P Routing” and obtains two set of documents, {d1, d2, d3} and {d3, d4} respectively.

The retrieved documents will be updated with their corresponding query terms.

When any other peer issues a query about ”Overlay Routing Algorithm”, peer j would look up relevant documents in the inverted index by using VSM cosine similarity as ranking algorithm, and d3 would gain the highest ranking.

CoopeerSemantic Routing Algorithm

each Coopeer client maintains a local Topic Neighbor IndexThe index records the used performance of remote peers which has similar topics to the local peer.These search sessions’ queries are used to represent the peers’ semantic content

session 1 >> is the local peer which has two topics (queries)other sessions below denote the remote peers are interested in by the local peer in some aspect. session 2 and 3 are relevant to ”P2P Routing” topic of local peer, while others are about ”Pattern Recognition”. The peers on a same topic are in descending order of the rate. The peers providing more interested resource would move to the top of an individual’s local index

Coopeerwith query-based inverted index, the precision of matching results of different subjects was almost 100%system uses information coming from centralized search engines, so the system is not aimed to replace CSEs, but to complement them.

CoopeerQuery based representation is Efficient in p2p because user’s evaluation can be utilized easily through the client application.

This is Inefficient in CSEs because gaining user evaluation through web browser is inefficient & impractical to store and index documents every user’s query.

Prevent spam, since it is difficult to pretend evaluation from human users.Use human searching experience better results

ODISSEAA distributed global indexing and query execution service

Maintains a global index structure under document insertions and updates and node joins and failures

the inverted index for a particular term (word) is located at a single node, or partitioned over a small number of nodes in some hybrid organizations.

Assume two tier architecture.

The system is implemented on top of an underlying global address space provided by a DHT structure

ODISSEASystem provide the lower tier of the two tier architecture.In the upper tier, there are two classes of clients that interact with this P2P-based lower tier:

Update clients insert new or updated documents into the system, which stores and indexes them. An update client could be a crawler inserting crawled pages, a web server pushing documents into the index, or a node in a file sharing system.

Query clients design optimized query execution plans, based on statistics about term frequencies and correlations, and issue them to the lower tier.

ODISSEA

ODISSEAGlobal Index

An inverted index for a document collection is a data structure that contains for each word in the collection a list of all its occurrences, or a list of postings. Each posting contains the document ID of the occurrence of the word, its position inside the document, and other information (in title? bold face?)each node holds a complete global postings list for a subset of the words, as determined by a hash function.

ODISSEAQuery Processing

a ranking function is a function F that, given a query consisting of a set of search terms q0,q1,…,qm-1 , assigns to each document d a score F(d, q0,q1,…,qm-1) . The top- k ranking problem is then the problem of identifying the k documents in the collection with the highest scores.

ODISSEAWe focus on two families of ranking functions,

The first family includes the common families of term-based ranking functions used in IR, where we add up the scores of each document with respect to all words in the queries.

The second formula adds a query-independent value g(d) to the score of each page;

ODISSEAFagin’s Algorithm

Consider the inverted lists for a search query with two terms q0 and q1 . Assume they are located on the same machine, and that the postings in the list are pairs (d,f(d,qi)),i {0,1}, where d is an integer identifying the document and " f(d,qi) is real valued. Assume each inverted list is sorted by the second attribute, so that documents with largest " f(d,qi) are at the start of the list. Then the following algorithm, called FA, computes the top-k results:

ODISSEAFA:(1)Scan both lists from the beginning, by reading one element from each list in every step, until there are documents that have each been encountered in both of the lists.(2) Compute the scores of these documents. Also, for each document that was encountered in only one of the lists, perform a lookup into the other list to determine the score of the document. Return the documents with the highest score.

ConclusionStill no P2P web search engine has outperformed Google!(+) Lot of resources for complex data mining tasks and for crawling whole surface web(+)Emergence of semantic communities also has a positive impact on p2p web search performance(-)lack of global knowledge(-)smart crawling strategies beyond BFS are hard to implement in a P2P environment without a centralized scheduler.

Some Open Problemshow to uniformly sample web pages on a web site if one does not have an exhaustive list of these pages?Bar-Yosseff converted the web graph into an undirected, connected, and regular graph. The equilibrium of a random walk on this graph is the uniform distribution. It is not clear how many steps such a walk needs to perform. A more significant problem, however, is that there is no reliable way of converting the web graph into an undirected graph.

Some Open ProblemsData Streams

The query logs of a web search engine contain all the queries issued at this search engine. The most frequent queries change only slowly over time. However, the queries with the largest increase or decrease from one time period over the next show interesting trends in user interests. We call them the top gainers and losers.Since the number of queries is huge, the top gainers and losers need to be computed by making only one pass over the query logs. This leads to the following data stream problem: Another interesting variant is to find all items above a certain frequency whose relative increase (i.e., their increase divided by their frequency in the first sequence) is the largest.

ReferencesThe anatomy of a large-scale hypertextual Web search engineSource Computer Networks and ISDN Systems Volume 30 , Issue 1-7 ,1998 Sergey Brin Lawrence PageMake it fresh, make it quick: searching a network of personal International World Wide Web Conference Budapest, Hungary , 2003 Towards a Fully Distributed P2P Web Search EngineProceedings of the 10th IEEE International Workshop on Future Trends of Distributed Computing Systems Jin Zhou, Kai Li and Li Tang 2004 Odissea: A peer-to-peer architecture for scalable web search and information retrieval by: T Suel, C Mathur, J Wu, J Zhang, A Delis, M Kharrazi, X Long, K Shanmugasunderam , 2003Space/time Trade-offs in Hash Coding with Allowable Errors B. Bloom. In Communications of ACM, volume 13(7), pages 422–426, 1970www.en.wikipedia.org

Extra Slides

Bloom Filters

a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. The more elements that are added to the set, the larger the probability of false positives.

Bloom FiltersAn empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps a key value to one of the m array positions.To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions.If any of the bits at these positions are 0, the element is not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements.

Bloom Filters

An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w, not in the set, is detected as a nonmember as it is mapped to a position containing a 0.

Bloom Filters

1 1 0 1 0 0 1 1 00 1 65432 7 8

Hash (“Uncle John’s Band”) = {0,3,7}

Width (w)

Hash (“Box of Rain”) = {1,3,8}

web search and p2p advisor: dr sushil prasad presented by: dm rasanjalee himali

Documents

web search engine work

web search engine architecture

web pages

web crawler

world wide web

cooperating web robots

list of urls

rasanjalee himali slide