cs401 project - web.mst.eduweb.mst.edu/~madrias/cs401-01/cs401project.doc · web vieweach word...
Post on 29-May-2020
15 Views
Preview:
TRANSCRIPT
CS401 PROJECT
Ranking in WHOWEDA
Project Members
Ritesh Sagi
Srikanth Bolledula
Mohammed Abdul Baseer
Ranking in WHOWEDA
Abstract:
The rate of growth of data on WWW is more than exponential. There is a huge amount of data on
a single topic available on the web. Almost all of the search engines present today are based on
some keywords search on the topic. The results obtained from the search engine are huge. User
is not interested in looking at all the search results and would like to find the data or document of
interest within the few top search results. Hence, there is a need to rank the search results. The
keyword search techniques do not provide more freedom to the user to specify his constraints.
WHOWEDA is a data model developed which searches for the user-interested document
depending on a query graph. The results obtained from the query graph search satisfy the
constraints specified in the query graph and each of the result resembles the other in its structure.
The results obtained are called a web tuples and are stored in a web table. This paper discusses
about the issues involved in ranking of these tuples and also presents the algorithm to rank these
web tuples. It also includes a prototype of implementation of one of the various cases of ranking.
Introduction:
World Wide Web has become one of the fastest growing applications on the Internet today. It
provides a powerful and easy to setup medium for almost any user on the Internet to disseminate
information. More and more information has become available on-line through WWW, from
personal data to scientific reports to up- to- the- minute satellite images. The information growth
rate is so alarming that it has become very difficult to find or search the information that is
relevant. This information explosion leads to a problem commonly known as resource discovery
problem. In order to find interesting WWW pages, a user has to browse through many WWW
sites. This is a very time consuming process.
Currently, information on the web may be discovered primarily by two mechanisms: browsers
and search engines. Existing search engines such as Yahoo, AltaVista, Google service millions of
queries a day, yet it is clear that they are less ideal for retrieving and ever growing body of
information on the web. Whenever a user searches for some required data, the search engine
provides with a list of URL’s that contain relevant results. The amount of results might be in
thousands and the user does not have patience and time to look into each of the search results.
1
Thus there is a growing need to rank the web results. All most all of the search engines existing
today perform their search using some important keywords. This method of search requires the
user to have knowledge of some keywords that might be present in the document of his interest.
It is not possible always that the user has such knowledge. He might be interested in providing
with a SQL like query with some additional constraints on the results of the search like., he might
be only interested in search results obtained only from a particular website, he might be
interested in finding web pages which has some kind of relationship between them and so on.
The present day search engines are not capable of doing such search. The Ware House for Web
Data model tries to fulfill these needs. It allows the user to provide a SQL query type request and
also allows the user to provide extra constraints on the links between pages of his interest. The
complete details about the WHOWEDA model are out of the scope of this paper. We introduce
you with the basic concepts of the data model.
Review of other related work:
Because of the growing demand of ranking the search results, there has been lot of work done
and various algorithms developed to tackle this problem. Information retrieval techniques like
Boolean retrieval model, vector space model, probabilistic retrieval model etc have existed since
long. Latent semantic indexing (LSI) in a somewhat different approach, takes into consideration
the drawbacks of the information retrieval techniques. The Boolean retrieval is the simplest of
these retrieval methods and relies on the use of Boolean operators. The terms in a query are
linked together with AND, OR and NOT [1]. This method is often used in search engines on the
Internet because it is fast and can therefore be used online. The vector space model procedure
can be divided in to three stages. The first stage is the document indexing where content bearing
terms are extracted from the document text. The second stage is the weighting of the indexed
terms to enhance retrieval of document relevant to the user. The last stage ranks the document
with respect to the query according to a similarity measure. The vector space model has been
criticized for being ad hoc [2]. In the probabilistic retrieval method [3], the probability that a
specific document will be judged relevant to a specific query is based on the assumption that the
terms are distributed differently in relevant and non-relevant documents. The probability formula
is usually derived from Bayes' theorem [4]. Expansion of the probabilistic retrieval model is to
incorporate relationships of the document descriptors [5,6].
Retrieval methods suffer from two well-known language related problems called synonymy
and polysemy [7]. Synonymy describes that a object can be referred in many ways, i.e., people
use different words to search for the same object. Examples of this are the words car and
automobile. Polysemy is the problem of words having more than one specific meaning. An
example of this is the word jaguar that could mean a well-known car type or an animal. Latent
2
Semantic Indexing (LSI) [7] offers a dampening or weakening of synonymy. By using a Singular
Value Decomposition (SVD) on a term by document matrix of term frequency, the dimension of
the transformed space is reduced by selection of the highest singular values, where the most of
the variance of the original space is. By using SVD the major associative patterns are extracted
from the document space and the small patterns are ignored. The query terms can also transform
into this subspace, and can lie close to documents where the terms do not appear. The
advantage of LSI is that it is fully automatic and does not use language expertise and the positive
side effect is that the length of the document vector becomes much shorter. All the above
discussed searching and ranking techniques base there search upon keyword input from the
user. As discussed earlier, the user might not always be able to provide the keywords that might
appear in the document of his interest and also he would like to specify some more constraints on
his query than just specifying the keywords.
Most recent of the works in this field has been done by web warehousing group at Nanyang
technological university, which investigates issues in Web databases and Web information
mining. The Whoweda project aims to design and implement a warehousing capability for an
organization that materializes and manages useful information from the World Wide Web (WWW)
in order to support strategic decision-making. It aims to build a data warehouse containing
strategic information derived from the WWW that may also interoperate with conventional data
warehouses. Some interesting work in ranking of the web tuples (search results, as mentioned in
WHOWEDA) has been done by Mr. Saurov Bhowmick in his paper titled “Ranking of web pages
using a global ranking operator”, which uses the concept of global coupling of WHOWEDA [9].
WHOWEDA model:
As we all know that most users obtain WWW information using a combination of search engines
and browsers, these two types of retrieval mechanisms do not necessarily address all of a user's
information needs. The search engines of present day fail to fulfill the present day user needs
because they are purely resource locators with no capability to reliably suggest the contents of
the websites they return in response to a query. Furthermore, the task of information retrieval still
burdens the user, who has to manually sift through 'potential' sites to discover the relevant
information. It is also possible that because of the presence of the mirror sites, the task of finding
the document of user interest becomes tedious.
To overcome the limitations of search engines and provide the user with a powerful and friendly
query mechanism for accessing information on the web, the critical problem is to find the effective
ways to build web data models of the information of interest, and to provide a mechanism to
manipulate these information to provide additional useful information. Until now, knowledge
3
discovery on the WWW has been limited to mining path traversal patterns by analyzing server
access logs (Web Log Mining) and extraction of semi-structured information from HTML
documents. This leaves much to be desired in deriving interesting, non-explicit patterns from the
web information base.
Whoweda (Warehouse of Web Data), a project started by the web warehousing group[ ] at
Nanyang technological university tries to look into this field. The objective of this group is: To
design and implement a Web warehouse that materializes and manages useful information from
the Web to support strategic decision-making. It aims to build a Web warehouse containing
strategic information coupled from the Web that may also inter-operate with conventional data
warehouses.
Architecture overview of Whoweda
Whoweda is a meta-data repository of useful, relevant web information, available for querying
and analysis. As relevant information becomes available in the WWW, this information are
coupled from various sources, translated into a common Web data model (Web Information
Coupling Model), and integrated with existing data in Whoweda. At the warehouse, queries can
be answered and Web data analysis can be performed quickly and efficiently since the
information is directly available. Accessing data at the warehouse does not incur costs that may
be associated with accessing data from the information sources scattered at different
4
geographical locations. In a Web warehouse data is available even when the WWW sources are
inaccessible. It is not possible to discuss the complete WHOWEDA model here in this paper, as
we are concerned with the ranking of the search results obtained from the searching done by the
WHOWEDA.
The model consists of collection of web objects known as nodes and links. Nodes represent the
web pages and links represent the hyperlinks connecting these web pages. Node is
characterized by the following attributes {url, title, format, size, date, text} and a link is
characterized by the following attributes {source-url, target-url, label, link-type}. The format of the
node object represents the file type of the node that can be HTML, Postscript, and PDF etc. The
link-type represents if the link between pages is an interior link, or local or an exterior link. A web
tuple is a connected and directed graph consisting of nodes and links. A collection of web tuples
described by a web schema is called a web table.
An example web schema written in as a file is shown in fig 1 and the same schema is
represented graphically in fig 2.
“http://www.umr.edu” “computer science” “madria”
e f
department faculty
5
[nodes]‘a’ URL Equals http://www.umr.edu/.‘b’ title contains “computer science”‘c’ title contains “madria”
[links]
e label contains “department”f label contains “faculty”.
Fig1 An Example Web Schema
cba
Fig 2. Graphical representation of web schema
The types of links present between the nodes (web pages) are:
Interior Link: This type of link links to a different portion of the same web page. The URL of the
source and the target documents is the same.
Local link: The target URL is on the same server as that of the source URL.
Exterior Link: The target URL is completely external to the source URL. In other words, the
target URL is not on the same server as the source URL.
Formally, a web tuple tw is a triplet,
tw = (Nw, Lw, Ew)
Nw is a set of nodes,
Lw is a set of links,
Ew is the set of connectivities in web tuple tw.
Formally, web schema M is represented by,
M = (Xn, Xl, C, P)
Xn – set of node variables
Xl – set of link variables
C – set of connectivities
P – set of predicates
The connectivities and predicates present in the WHOWEDA model are out of the scope of this
paper to be discussed.
The query from the user resembles the web schema as shown in fig2. It is called as the query graph. In fig2 , we have that we have constraints specified on the starting node or URL that it
should be http://www.umr.edu . And then on the link from that page having a label “department”
pointing to a web page containing “computer science” in its title and so on. As we can observe
that the user is provided with more flexibility to specify the constraints in order to obtain results of
his interest. The results obtained from this query graph are stored in web table as web tuples.
Here, in this project we are not concerned about the searching for the results of the query graph.
We assume that we are provided with the results of the search that match the query graph. We
are concerned only with ranking of these results obtained from the search.
6
Problem Statement and Analysis: The number of possible matches of a search query increases with time as more and more
documents are put on the web. The user’s ability to look at documents does not scale up in this
fashion. People are still only willing to look at the first few tens of result. It is necessary to extract
the “most relevant documents” in the web tuples to the user. Our notion of “relevant” web tuples
include only the very best tuples (hot tuples) since there may be large number of “slightly
relevant” tuples. It is necessary to rank the web tuples based on the user specified conditions. It
enables the user to browse the web tables efficiently. There is a subtle difference between
ranking and searching. We adopt a index and ranking (IR) system for ranking of web tuples.
Indexing is the first task an Index-Ranking(IR) system has to accomplish; you can only search
and retrieve a document if it has been indexed into a database. Each word (index) is represented
by a set of coordinates that describes where a word is located—in what document, sentence,
paragraph, heading, and so on. The resolution of the index controls whether retrieved pointers
point to whole documents, subsections within documents, or exact word locations.
Ultimately, IR systems analyze queries to see if the words in the queries are in the indexes. If
they are, the system displays the matching documents and ranks or scores them according to
various scoring methods/algorithms. The documents that receive the highest score represent the
best match, and the system presents them at the top of the list of retrieved documents.
As specified earlier, in our project we are concerned only with the ranking of web tuples
obtained by a query. It is assumed that they have been obtained from an index server who does
the indexing part as described above. The indexing in IR system is performed according to the
following model:
Result result
query query redirection
…. …. ….
7
User Client
Index Server
IndexServer
IndexServer
WWWsite
WWWsite
WWWsite
WWWsite
WWWsite
WWWsite
The various factors that are to be considered while ranking the web pages:
1. The existence of the query terms in the document or web page.
2. The number of occurrences of query items in the document.
3. Total size of the document.
4. The frequency of occurrence of the query terms in the document.
5. The number of links between pages containing the relevant document. A document
referred by or referring to some other page
6. The importance of the place of occurrence of the query terms: in URL or in title or inside
the web page.
7. The same object being referred in different ways by different people. That is the
synonyms of the query words.
8. The duplicated web pages. That is a replica of the same web page.
9. Domain restriction. That is document to be found within a particular domain of servers.
Ranking of web pages so that the user gets to find his data of interest in minimum time, has been
a recent and growing area of interest in Web Warehousing environment. Ranking is performed
using a scoring function.
Scoring Function: A scoring function assigns weights to components of a query based on such
considering all the factors specified above. Each of the factor assigns a weight to the web pages
and the resulting score, used for ranking, is the sum of these weights. The values of the weights
differ among document types and languages. When weighting, words that are indexed or used in
a query are assigned a numerical value usually based on the frequency of the words in stored
documents.
Ranking models can be probabilistic. By using information about word frequency, IR systems
can assign a probability of relevance to all the documents that can be retrieved. These
documents are then ranked according to this probability. Ranking is very useful because
document databases tend to be very large, and large return sets can be daunting if not sorted.
8
Approach/Model:
Ranking algorithm uses a scoring function based on which it assigns different ranks to different
web pages resulting from the query search and sorts the web pages in the decreasing order of
their ranks. The user may supply different criteria for ranking the pages. We consider some of
them .
Advanced search Criteria : a. Standard Condition : A user may be interested in a subset of original results as search engines provide a very broad
category of results. For example consider, if a user wants to search about SATURN which is not
a planet, we can do it as follows:
Saturn - Planet
Assumption: We shall make the assumption that if rank turns out to be zero, then it is eliminated.
Let k1, k2 .. be the keywords that the user wants to have and a1, a2,.. be the keywords the user
wants to be avoided, then the condition can be specified as,
k1, k2 - a1 , a2, a3
Let us assume that once the tuples are obtained, the ranks need to be redefined, so rank of a
web page ‘u’ will be defined as,
R(u) = c.f(k1).f(k2)…..q(a1)q(a2)q(a3)….
c is a constant
f(k1)= number of times or the frequency of the word k1 in the document
q(a1) be considered as a boolean function which helps in eliminating the pages that have the
keywords to be avoided.
q(a1) = { 1 if a1 does not occur in the document
0 if a1 occurs once or more in the document
}
So even if any of the keywords a1, a2, …..etc occurs at least once, then rank turns out to be
zero.
b. Location criteria :
9
A user may also specify the position of the keywords he is looking for such as occurrences of
keywords in title, URL or inside the document. This can be done as follows:
If a user enters a query which contains K1,K2…..Km keywords
K1,k2,k3, …. …….….km
We define the following expressions :
Wtitle : It is the weight assigned for a keyword present in the title of the document
Wtext : It is the weight assigned for a keyword present in the title of the document
Wurl : It is the weight assigned for a keyword present in the title of the document
Ftitle(k) : It is the frequency of the keyword k in the title of the document.
Ftext(k) : It is the frequency of the keyword k in the text of the document.
Furl(k) : It is the frequency of the keyword k in the url of the document.
Then rank can be approximately assigned as ,
Rank = Wtitle(kn)**Ftitle(kn) + Wurl(kn)**Furl(kn) + Wtext(kn)**Ftext(kn)
Wtitle(kn),Wurl(kn) and Wtext(kn) are the weights (constant values)assigned depending on the
ocurences of key word ‘kn’ in title,url and text of the document respectively. We assign these
values such that
Wurl > Wtitle > Wtext.
This is the default criterion, which we use in ranking the web pages. But if a user is interested in
some other criteria other then the default criteria, then we can change, the default criteria by
changing weights. For example if a user is interested in keywords occurring only in title of pages
then,
Wurl(kn) = 0 ; Wtext(kn) = 0;
and the rank will be automatically calculated with the pages having keywords in the title. The
page which has got the highest number of keywords in the title gets the highest rank.
c. Domain restriction : The user may be interested in looking at only those web pages that are from a particular server,
for example, suppose the user is interested in looking at only the UMR related web pages, now in
this case, the node variable would be url,
Now let us assume the keyword is k1,
Rank = c.Wurl(k1) ;
10
Wurl is the weight that will be assigned as a function of the frequency of the keyword in the url,
Wurl = { 0 if no keywords are present ,
w(f1) if keywords are present }
f1 is the frequency or the number of times it appears in the url.
Web Tuple, Web Table and Query Graph :
A web tuple is a directed connected graph. And a web table is a collection of web tuples. The
boxes and directed lines correspond to nodes and link variables. The keywords express the
content of the web documents or the labels of the hyperlinks between the web documents. The
webware house model consists of a hierarchy of web objects Nodes correspond to html or plain
text documents. And links correspond to the hyperlinks between web pages.These objects
consist of a set of attributes .
Node = [url, title, format, size, date, text ]
Link = [source-url, target-url, label, link-type ]
For the Node object , the attributes are the url of a node instance and its title, document, format,
size(in bytes) , date of last modification and its textual contents. For the link type, the attributes
are url of source and target documents, anchor or label of the link, and the type of link.
A Web Tuple is a collection of directed connected graphs each consisting of a set of nodes and
links which are instances of Node and Link respectively.
A query graph is defined as a 5 tuple G = {Xn, Xl, C, P, Q) where
Xn is a set of node variables,
Xl is a set of link variables ,
C is a set of connectivities ,
P is a set of predicates over the node and link variables
Q is a set of predicates on the complete query graph .
Example:This is an example of a query graph and the global web coupling operator can be applied to
retrieve those sets of related documents that match the query graph.
Observe that some nodes and links have keywords imposed on them. These keywords express
the content of the web document or the label of the hyperlink between the web documents.
11
News
sports soccer
a b c
Example Of Web Tuples :
url: Cnn.com
Title:cnn.com
News: 5 times
sports url:sportsillustrated.cnn.com
title: cnncsi.com from cnn and sportsillustrated
sports: 11 times
soccer
url:sportsillustrated.cnn.com/sports/soccer
title: cnnsi.com from cnn and sportsillustrated
soccer: 7 times
url:abcnews.com
title:abcnews.com
home
news:11 times
sports url:abcnews.go.com/section/sports/
title:abcnews.com sports index
sports: 4 times
soccer url:espn.go.com/soccer/index.html
title:espn.com- soccer- index
soccer: 3 times
url:foxnews.com
title:foxnews.com
sports url:foxsports.com
title:foxsports.com
more
sports
url:foxsports.com/more/home/index.cfm…
title:foxsports.com more sports
12
news:7 times sports:11 times
soccer
url: foxsports.com/mls/home/index.cfm
title:foxsports.com – soccer
soccer: 3 times
url:msnbc.com
title:msnbc
cover
news:8 times
sports url:msnbc.com/news/spt_front.asp
title:msnbc sports
sports: 5 times
more
sports
url:msnbc.com/news/xmoresports-
front.asp
title:other sports front page…
Nbc.: soccer star helping launch new womens pro league
url:msnbc.com/news/561402.asp
title:Mia Hamm: selling her dream
soccer: 9 times
url:usatoday.com/news/
nfront.htm
title:usatoday.com
news:12 times
sports url:usatoday.com/sports/
sfront.htm
title:usatoday.com
sports:10 times
soccer url:usatoday.com/sports/
soccer/sos.htm
title:usatoday.com
soccer:4 times
13
url:nytimes.co
m
title:new york
times on the
web
news:13 times
sports url:nytimes.com/pages/
sports/index/html
title:the new york times
sports
sports: 9 times
soccer url:nytimes.com/pages/sports/
soccer/index.html
title:The new york times soccer
soccer:11 times
The most relevant tuples to the user must be ranked high . The most relevant means those
tuples that the user finds most interesting.
Ranking Operator and Ranking Function : is a ranking operator
W is the web table which consists of web tuples
WR is the ranked set of web tuples.
W R = (W, condition(s))
The operator takes in the web table to be ranked and user specified node variables and
keyword conditions based on which ranking is performed.
The output is a ranked set of web tuples.
14
Specification of Ranking condition :
Type 1 :
The user specifies the condition for the keyword for a single node variable .
The keywords specified by the user can be present either in the url or the title or the text of the
webpage,
The query can be expressed as,
RANK web_table_name
WHERE (node_variable,keyword)
The where clause is specified by the user and ranking is performed based on these conditions.
Assume the user enters, say,
The keywords as k1,k2,……km and a web page u ,
Then rank can be approximately assigned as ,
Wtitle, Wurl have been explained previously,
Rank(u) = Wtitle(kn) ** Ftitle(kn) + Wurl(kn) ** Furl(kn) + Wtext(kn) ** Ftext(kn)
For easier terminology, we use weight factor w1, to be equal to Rank(u) .
Wtitle is a weight which is a constant depending on the user options, these can be assigned,
Suppose the user selects an option which says to select only pages that have the keyword kn in
the title,assume 1 and 2 are small compared to the Wtitle, then
Wurl(kn) = 1 ; Wtext(kn) = 2 ;
this is done so that if two tuples result in the same rank , then these weights help in looking at the
url and the text to decide on the rank of the tuple.
and the rank will be automatically calculated with the pages that have the highest number of
times that the keyword appearing in title getting the highest rank, but if the user specifies the
keywords only ,
If there are two or more web tuples which have nodes having the same number of keywords,
then the tuples take into account the following conditions to rank them.
Webpages which have keywords appearing in url or title are considered to be more
accurate than others,
15
Since we are looking at only one node and its conditions, the concept of incoming links
does not arise here.
Here since the user specifies only a single node variable and the keyword condition
imposed on it, we don’t care about the hyperlinks and the labels.
Specification of Ranking condition :
Type 2 : Here the user specifies the start node variable and the end node variable, he may also enter a set
of keywords with some conditions on which nodes it should appear and where it should,
Here the primary idea is to :
Calculate the ranks of all the nodes between the start and end nodes including them and sum it
up to get the total rank of the web tuple.
The query can be expressed as ,
RANK web_table_name
WHERE (start_node_variable,end_node_variable,keyword)
Unlike the first condition which was the simplest case, now we have to take into consideration the
interrelation ship between the different nodes as depicted by the query graph .
So , now we will assign four scores to the node and sum them up to get the actual rank of the
node, these four scores take also into account the interconnection between the tuples so that the
most relevant tuples are brought up in the ranking.
16
Algorithm :
For (I=1 to m)
{
Webtuple(i) = weight(start node) + weight(next node) + …….+ weight(end node)
;
//Webtuple(i) indicates the ranking score of the ith web tuple.
}
//where m is the number of web tuples in the web table.
Rank the web tuples in the descending order of ranking scores.
Weight of a single node is calculated as follows .
Weight(node) = w1 + w2 + w3+ w4 ;
Weight w1 has been explained previously, we proceed to explain the other three weight
factors,
Weight W2 :
It should be pointed out that the ranking scores are assigned in such a way that the web tuple
that satisfies the user query graph and has the highest occurrences of the keywords in locations
as specified by the user , and with the shortest path satisfying the user query graph will get the
highest rank.
That means , the other weights assigned on various other factors make a difference only when
two webtuples almost get the same ranking score .
W2 = some constant * f(number of occurrences of the keyword specified for previous node in the
current node)
Where f is a function of the frequency of the label on the incoming link,
An example will make it clear,
17
Supposing the user query graph is :
Panacea.org Drug list Issues
side
effects
a b c d
For convenience purposes, we consider only part of the query graphs,
Here the user wants a web page containing the keyword drug list followed by a web page having
the keyword containing the keyword issues.
It is clear that the user is looking for some issues related to the drug list, but when we carefully
look at the web tuple 2 , it is quite possible that the second page may be addressing other health
issues not necessarily related to the drug lists.This is because the node c is unbound in the
The Drug list for the disease
myopis …
Issues concerning the drugs
in the drug list,…..
coronary failure requires the
following drugs specified in
the Drug list
The primary health issues to
be concerned about ….
18
sense it is not defined precisely by any predicate. But in the case of the first web tuple, the
second page also contains the “drug list” keyword in addition to the “issues” keyword.
We can conclude that the first web tuple more accurately reflects the information the user is trying
to look for, so web tuple1 will be ranked high,
Weight W3 :
Now if you look at the query graph, it does not explicitly say anything about the number of nodes
to traverse to get to the end node variable,
But if user specifies a query graph and a set of keyword conditions for the nodes, it is clear that
the user is definitely not interested to go through a long path in which case he looses interest and
may move on to other pages,
In one of the previous examples ,
The query graph was :
News
sports soccer
a b c
The user is interested in news and he wants to browse through the sports page and go to find
some news on soccer,
As you can observe from the web tuples, cnn has a direct link to soccer from the sports page,
where as the foxnews page and the msnbc pages do not have direct link to the soccer page, it is
listed on the more sports page from where there is a link to the soccer page, this clearly indicates
that these pages do not give as much importance to the soccer game, so obviously appear down
in the ranked set of tuples relative to others with direct straight links,
19
Let Pq be the least path length satisfying all the keyword conditions given in the query graph,
and let Pw be the path length which is the number of nodes traversed to reach the end node from
the start node.
W3 = some constant * f(path difference) ,
f is a function of the path difference and can be defined as ,
f = 1 if Pq = Pw
1/(Pq-Pw)**2 if Pw>= Pq
This will help in longer paths to be pulled down in the ranked set of web tuples
Weight W4:
This factor primarily takes into account the importance of links to the next nodes.
For example , in the previous query graph, the user specifies the keyword side effects, suppose
he explicitly specifies that the label must contain the keyword side effects, but in cases , with two
tuples containing the keyword side effects in their labels, the one which points to a node whose
url also contains this keyword will be ranked higher, this is because the web page which has the
word side effects in its url is most likely to describe the side effects accurately than a page which
does not have the keyword in the url.
W4 = some constant * (frequency of the keyword in url of next node) ;
If we observe the sports related tuples, we observe that the foxnews does not have the keyword
explicitly in its url, but the other pages like the cnn.com and nytimes.com have the keyword
soccer in their url, which gives them a definite advantage over other pages.
20
Weight W5 :
This factor takes into account the fact that if lot of web tuples go through a particular path, then
that particular piece of path could be quite important, so all those tuples passing through that
particular path should be given higher importance .
But the question arises : if there are two or more such paths which should be given more
importance, in that case we look at the query graph and the keywords, and then give relative
importance to them based on the location and frequency of the keywords.
W5 = some constant * (relative weight of a particular path)
Weight of the particular path can be calculated using the first weight factor .
Specification of Ranking condition :
Type 3 :
In this case , the user specifies only the depth of the tuple, he would like to browse through a
specified number of pages starting from a particular node, and the tuples which interest the user
most within that particular depth must be ranked high,
In this case, the query can be expressed as,
RANK web_table_name
WHERE (start_node_variable,depth, keyword)
In this case the we do not take into account the weight W3 because here the user explicitly
specifies the depth , so it is not necessary to consider the path length required to traverse the
path,
Algorithm :
21
For (I=1 to m)
{
Webtuple(i) = weight(start node) + weight(next node) + …….+ weight(end node)
Endnode = startnode + depth ;
//Webtuple(i) indicates the ranking score of the ith web tuple.
}
//where m is the number of web tuples in the web table.
Rank the web tuples in the descending order of ranking scores.
Weight of a single node is calculated as follows .
Weight(node) = w1 + w2 + w4+ w5 ;
Repetitive Paths :
In some web tuples , a particular path may form a repetitive loop, but this will be taken care of by
the weight factor W5 .
One of such examples is shown below :
sanjay madria webdb
http://www.informatik.uni-trier.de/
~ley/db/conf/webdb/index.html
webdb sanjay madria
http://www.informatik.uni-
trier.de/~ley/db/indices/a-tree/
m/Madria:Sanjay_Kumar.html
22
Suppose the first graph be the user query graph , he does not specify the end node variable, but
he specifies only the depth and condition being that the first page must contain the keyword
madria and it should point to some page containing the keyword webdb ,
If we carefully look at the path, we observe that there is a link to the web db from the home page
of Dr.Madria and when we browse through the web db pages , one of the pages in the web db
has a link to Dr madrias home page, there is a repetitive patteren of the user query graph,
Obviously this repetitiveness indicates a strong correlation between the web pages,
Specification of System Defined Condition :
In this case , there is no explicit location criteria for the location of the keywords, so the default
values of the weights need to be taken into account, rather there are some keyword conditions
based on which the system has to rank the tuples,
The query can be expressed as :
RANK web_table_name
WHERE (startnode specification, keyword conditions)
Weight(node) = w1 + w2 + w3+ w4+w5 ;
w1 = Wtitle(kn)**Ftitle(kn) + Wurl(kn)**Furl(kn) + Wtext(kn)**Ftext(kn) +
Wolink(kn)**Folink(kn)
Wolink indicates the weight assigned to a keyword present in an outgoing link.Generally it has a
small value as a keyword in an outgoing link does not accurately describe the page it is present
in.
then the weights can be selected as,
Wurl > Wtitle > Wtext > W olink
In the default case, the keywords in the url is given more importance , next the title and next the
content of the document.
23
All other factors remain the same in this case.
Conclusions:
We would be discussing web tuples containing single web pages different with that of the tuples
with the tuples containing a bunch of related web pages. That is we shall be discussing ranking of
web pages different from that of ranking of web graphs. And we guess the algorithms and their
implementation would provide a basis for future improvements like mixing both of the ranking
cases as one whole and may lay a foundation for more better ranking algorithms. We are just
trying to implement the already existing algorithms for different situations and try to find how they
actually work.
Implementation:
The implementation of the ranking algorithm involves the storage structure for the nodes and the
links and also the language selection and platform on which they were implementation. We have
taken a dummy example consisting of the query graph given above in the example and stored the
results obtained from this query graph. All the nodes obtained in the resultant web table are
stored independently in a separate table and all the links were stored in an independent table. We
maintained another table consisting of tupleID and the NodeID to represent the relationship
between the different nodes connected with each other with links in the actual web tuple obtained
from the search results. We have stored these tables in MS Access and then implemented the
Default- user ranking of web table in ASP.
24
The User defined ranking was not implemented. It is similar to the default-user ranking only with
few additional conditions.
25
References:
1. Data visualization Operators for WHOWEDA by SS Bhowmick, Sanjay K Madria, Wee Keong
Ng and Ee- Peng- Lim
2. Slides on “Ranking web pages using global ranking operator” by Saurav Bhowmick.
3. “Ranking of Graphs ” by Hans L.Bodlaender , Jitender, Klaus, Ton , Dieter, Haiko , Zsolt ,
WG 1994:292-304
4. Efthimis N. Efthimiadis: User Choices: A new Yardstick for the Evaluation of Ranking
Algorithms for Interactive Query Expansion. Information Processing and Management 31(4): 605-
620 (1995)
5. Hans L. Bodlaender, Jitender S. Deogun, Klaus Jansen, Ton Kloks, Dieter Kratsch, Haiko
Müller, Zsolt Tuza: Rankings of Graphs. SIAM Journal on Discrete Mathematics 11(1): 168-181
(1998)
6.Wai Yee Peter Wong, Dik Lun Lee Implementations of Partial Document Ranking Using
Inverted Files. Information Processing and Management 29(5): 647-669 (1993)
7. Ananth V. Iyer, H. D. Ratliff, G. Vijayan : Optimal Node Ranking of Trees. IPL 28(5): 225-229
(1988)
8. Robert M. Losee, Lee Anne H. Paris: Measuring Search-Engine Quality and Query Difficulty:
Ranking with Target and Freestyle. JASIS 50(10): 882-889 (1999)
9. Heaps, H. S. Information retrieval, computational and theoretical aspects. Academic Press,
1978.
10. Raghavan, V. V. and Wong, S. K. M. A critical analysis of vector space model for information
retrieval. Journal of the American Society for Information Science, Vol.37 (5), p. 279-87, 1986.
11. Fuhr, Norbert. Probabilistic Models in Information Retrieval. Computer Journal. 35 (3) , p.
243-55, 1992.
12. van Rijsbergen, C. J. Information retrieval. Butterworths, 1979.
13. Kim, Won-Yong; Lee, Yoon-Joon and Kim, Myoung-Ho. Probabilistic retrieval with
incrementally constructed knowledge. Proceedings of the IPSJ International Symposium on
Information Systems and Technologies for Network Society, p. 241-248, 1997.
14. Kim, W.Y.; Kim, M.H. and Lee, Y.J. Probabilistic retrieval incorporating the relationships of
descriptors incrementally. Information Processing and Management Vol.34 (4), p. 417-430, 1998.
15. Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K. and
Harshman, Richard. Indexing by Latent Semantic Indexing. Journal of the American Society for
Information Science. 41(6), p. 321-407, 1990.
16. Finding near-replicas of documents on the web by N. Shivakumar, Hector Garcia- Molina.
26
GLOSSARY :
WHOWEDA : Ware house of web data
Node : A web page is considered as a node variable
Link : A hyper link from one page to another
Query Graph : A coupling frame work specified by the user
Unbound Node : Node which is not defined by any predicate
Web Tuple : It is a connected directed graph which satisfies a given query graph
Web Table: A set of web tuples is materialized in a table called the web table.
Web Schema : schema that contains meta information that binds a set of web tuples in a web
table.
Expressions :
Wtitle(K) : Weight assigned for a keyword present in the title .
Wurl(K) : Weight assigned for a keyword present in the url .
Wtext(K) : Weight assigned for a keyword present in the text .
furl(K) : frequency of a keyword k in the url
ftitle(K) : frequency of a keyword k in the title
ftext(K) : frequency of a keyword k in the text
Pq : the least path length satisfying all the keyword conditions given in the query graph
Pw : the path length which is the number of nodes traversed to reach the end node from the
start node.
27
top related