world wide web information retrieval using web connectivity
TRANSCRIPT
i
World Wide Web Information Retrieval Using Web
Connectivity Information
Jiafei Sun
Directed by Wen-Chen Hu
Committee Members
Gerry V. Dozier
Dean Hendrix
A Project Report
Submitted to the Advisory Committee
In Partial Fulfillment of the
Requirements for the
Degree of
Master of Computer Science and Software Engineering
Auburn University
May 12, 2001
ii
PROJECT ABSTRACT
Gathering, processing and distributing information from the World Wide Web
will be a vital technology for the next century. Web search techniques have played a
critical role in the development of information systems. Due to the diverse nature of web
documents, traditional search techniques must be improved. Hyperlink structure based
methods have proved to be powerful ways of exploring the relationships between web
documents.
In this project, a prototype web search engine was developed to exploit the link
structure of web documents, based on the use of the Companion algorithm. The prototype
consists of a web spider, local database, and search software.
The system was written using the Java programming language. Our spider crawls
and downloads web pages using Lynx, then saves the hyperlinks into an Oracle database.
JDBC is used to implement the database processing. Search software makes a vicinity
graph for the query URL and returns the most related pages after calculating the hub and
authority weights. Finally, HTML web pages provide user interfaces and communicate
with CGI using the Perl language.
iii
ACKNOWLEDGMENTS
The author would like to express thanks to all of the members of his M.S.
committee for their useful comments on the thesis, assistance in scheduling the defense
date and kind help during the final defense period. The author would like to express his
deepest appreciation to Dr. Wen-Chen Hu, his thesis mentor, for the depth of the training
and the appropriate guidance he has provided. The author would also like to
acknowledge the Department of Computer Science and Software Engineering of Auburn
University for financial support. Finally, thanks especially go to the author’s wife
Qifang, his son, Alex, and his father and mother for their support and love.
iv
TABLE OF CONTENTS
LIST OF FIGURES …………………………………………………………………… vi LIST OF TABLES ………………………………………………………………….... viii CHAPTER 1 INTRODUCTION …………………………………………………….. 1 1.1 World Wide Web and Web Searches .………………………………… 1
1.2 Disadvantages of the Current Search Engines …………………………..2
1.3 Project Outline.…………………………………………………………..3
1.4 Report Organization ….………………………………………………... 5
CHAPTER 2 RELATED TECHNOLOGIES USED IN THIS PROJECT ………….. 6
2.1 Web Search Services ………………………………………………….. 6
2.2 Web Spiders …………………………………………………………… 8
2.3 Database and Web-database Connectivity …………………………… 10
2.4 Data Mining ……………………..…………………………………… . 14
CHAPTER 3 SYSTEM DEVELOPMENT …………………………………………. 17
3.1 System Overview ……………….…………………………………….. 17
3.2 The Spider ………………………………………………….…………. 19
3.3 Indexing …………….. ………………….…………………………… . 23
3.4 System Interface………………………………………………………. 25
v
CHAPTER 4 RELATED-PAGE ALGORITHMS …………………………………. 30
4.1 Building the Vicinity Graph ………………………………………….. 31
4.2 Duplicate Elimination …..………………………………………………34
4.3 Edge Weight Assignment ……………….……………………………..36
4.4 Hub and Authority Score Computation ………………………………..37
4.5 An Example ………………………. …………………………………..40
CHAPTER 5. EXPERIMENTAL RESULTS …………………….………………….. 44
5.1 The Spider ………………………………………………………………44
5.2 Checking the Downloads ……………………………………………….47
5.3 Searching and Ranking ………………………………………………...50
CHAPTER 6. CONCLUSIONS ………………………………..…………………….. 57
REFERENCES ……………………………………………………………………….. 58
vi
LIST OF FIGURES
Figure 2.1 Search engine structure ……………………..…………………………… 7
Figure 2.2 JDBC – Architecture overview ………………………………………… 11
Figure 3.1 Program architecture …………………………………………………… 18
Figure 3.2 The pictures of class node and node1 ………………………………..… 22
Figure 3.3 Project interface ………………………………………………………… 26
Figure 3.4 The spider interface …………………………………………………. …. 27
Figure 3.5 The interface for checking the downloads …..………………………….. 28
Figure 3.6 The interface for the search engine ………………………………………. 29
Figure 4.1 A vicinity graph example ………………………………………………... 33
Figure 4.2 Search results of the query: http//www.eng.auburn.edu/hu/ .………..…… 42
Figure 5.1 The interface of the spider with seed URL http://www.eng.auburn.edu/ .…45
Figure 5.2 The crawling results of the spider from the seed URL
http://www.eng.auburn.edu/ ………………………………………….46
Figure 5.3 The interface for checking the download …………………………….……48
Figure 5.4 Checking the downloads …………………………………………………..49
Figure 5.5 Search results of the query: URL = http://www.oasis.auburn.edu/,
B=500, BF=500, F=500, FB=500………………………………………51
vii
Figure 5.6 Search results of the query: URL = http://www.lib.auburn.edu/,
B=500, BF=500, F=500, and FB=500………………………………….52
Figure 5.7 Search results for the query http://www.eng.auburn.edu/ ………………….54
Figure 5.8 Search results for the query http://aos.auburn.edu/ ………………………...55
viii
LIST OF TABLES
Table 1.1 Output of the Companion algorithm given http://www.auburn.edu/ ………4
Table 4.1 The authority and hub values for the vicinity graph shown in Fig 4.1 …… 33
Table 4.2 URLs and their parent and child URLs………… ……………………..… 41
1
Chapter 1 Introduction
The gathering, processing and distribution of information from the World Wide
Web will be a key technology for the next century. As the Internet has tens of millions of
sites, web search techniques play a critical role in the development of information
systems. This chapter will introduce the web and the techniques needed to search it and
look at the problems involved, and an overview of this project will be given.
1.1 The World Wide Web and Web Searches
The web is a system with universally accepted standards for storing, retrieving,
formatting, and displaying information via a client/server architecture. The web handles
all types of digital information, including text, hypermedia, graphics and sound, by way
of graphical user interfaces, making it very easy to use.
The web is based on a standard hypertext language (HTML), which formats
documents and incorporates dynamic hypertext links to other documents stored on the
same or different computers. Clicking on a hypertext link transports the user to another
document. Users are able to navigate around the web with no restrictions, free to follow
2
their own logic, needs, or interests, by specifying a uniform resource locator (URL)
which points to the address of a specific resource on the web.
The WWW hosts a plethora of information with millions of web pages. The
publicly indexable Web contained an estimated 800 million pages as of February 1999,
encompassing about 15 terabytes of information or about 6 terabytes of text after
removing HTML tabs, comments, and extra white space. This huge number of web pages
has been increasing by a million pages a day [1].
1.2 Disadvantage of the Current Search Engines
As there are so many web pages and so much information, search engines have
been developed to help retrieve desired information. Search engines are programs that
return a list of web sites or pages (designated by URLs) that match some user-selected
criteria. To use one of the publicly available search sites, the user navigates to a search
engine site and types in the subject of their search.
When the web first became available to the public, the only way to navigate it was
by surfing with the browser. Users began with a known page and traveled via its
hypertext links, browsing until they found the information they were seeking. When there
were only a few hundred HTTP servers on the Internet, this method was adequate for
most users. Most of these machines contained centrally managed collections of
documents by students and professional researchers who were aware of the major
repositories of information in their fields. However, the rapid growth of the web, both in
3
terms of the number of pages and the kinds of resources available, resulted in a need for
better ways of organizing information on the web.
Although search engines greatly help people to find some interesting web sites, at
present they suffer from several disadvantages. One is that users have to formulate
queries that specify the information they need, which is prone to errors. Another is that
all search engines do keyword searches against a database, but various factors influence
the results obtained from each. The size of the database, frequency of updates, search
capability and design, and speed may lead to significantly different results.
1.3 Project Outline
Recent work in information retrieval on the web has recognized that the hyperlink
structure can be very valuable for locating information [2,3,4,5]. This assumes that if
there is a link from page v to page w, then the author of v is recommending page w, and
links often connect related pages. In this project, the hyperlink structure contained in a
web page is used to identify related pages, overcoming the above disadvantages.
A related web page is one that addresses the same topic as the original page, but is not
necessarily semantically identical. For example, given http://www.nytimes.com/, the tool
should find other newspapers and news organizations in the web. Traditional web search
engines take a query as input and produce a set of (hopefully) relevant pages that match
the query terms. In this project, the input to the search process is not a set of query terms,
but the URL of a page, and the output is a set of related web pages. Of course, in contrast
4
to search engines, this approach requires that the user has already found a page of
interest.
This project is based on the use of Companion algorithm, which uses only the
hyperlink structure of the web to identify the related web pages. For example, Table1.1
shows the output of the Companion algorithm when given http://www.auburn.edu/ as
input.
Table1.1 Output of the Companion algorithm given http://www.auburn.edu/
URL Hub Authority Rank(A)
http://www.auburn.edu/ 7.0 2.0 7.0
http://www.lib.auburn.edu/ 3.0 1.0 4.0
http://www.auburn.edu/student_info/bulletin/ 0.0 1.0 4.0
http://www.auburn.edu/main/calendar.html 0.0 4.0 4.0
http://oasis.auburn.edu/ 0.0 1.0 4.0
http://aos.auburn.edu/ 0.0 1.0 4.0
http://www.auburn.edu/main/au_campus_directory.html 0.0 1.0 4.0
http://www.eng.auburn.edu/ 4.0 0.0 0.0
This project was designed to use a spider to exploit the hyperlink structure of the web
pages and save their URLs to the Oracle database, then find related pages by applying the
Companion algorithm. This algorithm only uses information about the links that appear
on each page and the order in which the links appear. They neither examine the content
of pages, nor do they examine patterns of how users tend to navigate among pages.
5
The Companion algorithm is derived from the HITS algorithm proposed by Kleiberg
for ranking search engine queries [6,7]. Kleinberg suggested that the HITS algorithm
could also be used for finding related pages, and provided anecdotal evidence that it
might work well. In this project, the algorithm was extended to exploit not only links but
also their order on a page.
1.4 Report Organization
Chapter 2 explains some related technologies used in this project. Chapter 3
describes the system structure. It covers the project architecture, the design and
development details of various components such as the spider, database, and common
gateway interface (CGI). Chapter 4 presents the Companion Algorithm and how to
implement each step of the algorithm in detail. Chapter 5 presents the system interface
and the results obtained by the new search engine for the related pages. Finally, Chapter 6
gives our conclusions as to the effectiveness of this approach.
6
Chapter 2. Related Technologies Used in this Project
To build a search engine, we need several different components. This chapter
introduces some related technologies we will use in this project.
2.1 Web Search Services
The Internet contains millions of sites at this point; growth is exponential and
bibliographic control does not exist. To find the proverbial needle in this immense
haystack, it is necessary to use web search techniques. Basically, there are currently two
different ways to search the web, namely search directories and search engines. Search
directories, such as Yahoo, present information to users in a hierarchically structured
format. There are numerous broad categories from which a user can select, each of which
is further divided into subcategories. Directories are the best choice for general
browsing. However, to locate specific information, it is preferable to specify the search
terms and use a search engine. In this project, a search engine is used to find related web
pages.
There are three main parts to a search engine as shown in Figure 2.1:
7
Figure 2.1 Search engine structure.
1. The crawler, also known as a “spider” or “robot,” is a program that automatically
scans various web sites. Crawlers follow the links on a site to find other relevant
pages.
2. The index is created to organize URLs, keywords, links, and text. Indexing builds
a data structure that will allow quick searching of the text.
8
3. The search software goes through the index to find web pages with key word
matches or other matching techniques and ranks the pages in terms of relevance
when a user submits a search query.
The search engine sends out "robots" or "spiders" into the World Wide Web to locate
new pages for their database. As a spider visits a web site it finds the hyperlinks it
contains, leading it to other pages within the web site and to related pages on the wider
web. The search engine will then index the subpages or important product pages. In this
chapter, these different components will be discussed in detail.
2.2 Web Spiders
A search engine does not actually scan the web when it receives a request from a
user. Rather, it looks up the keywords, phrases or links in its own database, which has
been compiled using spiders that continuously travel across the web looking for new
pages, updating existing entries and creating links to them.
There are two algorithms which may be used to direct the way a spider conducts a
search: one is depth-first and the other is breath-first. A depth-first search is a
generalization of a preorder traversal search, while a breadth-first search is like an order-
level search in the tree. Depth-first searches create relatively comprehensive databases on
a few subjects, while breadth-first searches build databases that touch more lightly on a
wider variety of documents. For example, to create a subject specific index, it is better to
select a known relevant group of pages to explore, rather than one initial document, thus
9
employing a depth-first strategy with an upper bound on the degree of depth. This would
prevent the search from becoming lost in inevitable digressions when exploring a largely
unknown space in some depth. On the other hand, if an index needs to be built for
sampling a wide variety of the web, it is better to choose a breadth-first strategy. The
breadth-first algorithm is used in this project because related pages are spread over a wide
variety of web documents.
A spider scans various web sites and extracts URLs, keywords, links, text, etc. It is
also able to follow the links on a web site to find other relevant pages. In this project, the
link structure is used to explore the relation between web pages while at the pages that
are possibly related to the user’s query are simultaneously collected to form a vicinity
graph of the web. The query term is first submitted to a text based search engine, then the
spider is used to find all the links on each returned page. Thus, the spider is used here to
crawl the web pages and obtain the vicinity related page collection that is used for the
Companion algorithm, which will be explained in detail in the Chapter 4.
The spider can be implemented in several languages, including C, C++, Perl, and
Java. Java is an innovative programming language that enables programs called applets to
be embedded in Internet web pages. Java also permits the use of application programs
that can run normally on any computer that supports the language. A Java program does
not execute directly on any computer, but runs on a standardized hypothetical computer
called the Java virtual machine, which is emulated inside the computer by a program. The
Java source code is converted by a Java compiler to a binary program consisting of byte
codes. Byte codes are machine instructions for the Java virtual machine. When a Java
program is executed, a program called the Java interpreter inspects and deciphers the byte
10
codes for it, checks to ensure that it has not been tampered with and is safe to execute,
and then executes the actions that the byte codes specify within the Java virtual machine.
A Java interpreter can run stand-alone, or it can be part of a web browser such as
Netscape Navigator or Microsoft Internet Explorer, where it can be invoked
automatically to run applets in a web page.
2.3 Database and Web-database Connectivity
Another advantage to using the Java language is that JavaSoft’s database connectivity
specification (JDBC) can then be used to connect to the Oracle database, which it uses to
save the information collected from the web pages which have been visited. The JDBC
application programming interface (API) is a Java API which can be used to access
virtually any kind of tabular data. It consists of a set of classes and interfaces written in
the Java programming language that provide a standard API for tool/database developers
and makes it possible to write industrial strength database applications using an all-Java
API. The Java API makes it easy to send search and query language (SQL) statements to
relational database systems and supports all dialects of SQL. The value of the JDBC API
is that an application can access virtually any data source and run on any platform that
supports the Java Virtual Machine. The JDBC API is used to invoke SQL commands
directly. A JDBC driver makes it possible to:
1. Establish a connection with a data source.
2. Send queries and update statements to the data source.
3. Process the results.
11
Figure 2.2 is an overview of the architecture of JDBC.
Fig 2.2 JDBC - Architecture Overview
The JDBC API works very well in this capacity and is easier to use than any other
database connectivity APIs. However, it is also designed to be a base upon which to build
alternate interfaces and tools. Since the JDBC API is complete and powerful enough to
be used as a base, it has had various kinds of alternate APIs developed on top of it,
including the following:
12
1. An embedded SQL for Java. JDBC requires that SQL statements be passed as
uninterpreted strings to Java. An embedded SQL preprocessor provides type
checking at the time of compiling and allows a programmer to mix SQL
statements with Java programming language statements.
2. A direct mapping of relational database tables to Java classes. In this object-
relational mapping, a table becomes a class, each row of the table becomes an
instance of that class, and each column value corresponds to an attribute of that
instance. The programmer can then operate directly on Java objects.
A database is an organized collection of related data, designed, built and populated
with data for a specific purpose, and represents some aspect of the real world. A database
management system (DBMS) is a collection of programs that enables user to create and
maintain a database. A database system provides a number of benefits:
1. Controlling redundancy
2. Restricting unauthorized access
3. Providing persistent storage for objects and data structure
4. Providing up-to-data information to all users
5. Providing multiple user interfaces and sharing data among multiple transactions
6. Representing complex relationships among data
7. Enforcing integrity constraints
8. Providing backup and recovery
Defining a database involves specifying the data types, structure, and constraints for
the data stored in the database. Constructing the database is a process of storing the data
themselves in some storage medium that is controlled by the DBMS. Manipulating a
13
database includes functions such as querying the database to retrieve specific data,
updating the database to reflect changes in the database and generating reports from the
database.
Database systems can be based on several different approaches: hierarchical, network,
relational, and object-relational models. Previously, relational databases (RDBMS)
dominated the marketplace, with vendors such as Oracle, Sybase, Informix, and Ingress.
In this model, data are represented as rows in tables and operators are provided to
generate new tables from the old ones. Recently, Object-Oriented (OO) DBMSs have
come to the fore and play more and more important roles. In order to compete with
OODBMS, RDBMS vendors provide hybrid products, such as Universal Servers,
offering OO/multimedia capabilities and Oracle 8’s user-defined data types.
Oracle 8i is one of the latest products from Oracle Corporation. It is an extremely
powerful, portable and scaleable database for developing, deploying and managing
applications within the enterprise and on the Internet. It was designed to support very
large and high volume transaction processing and data warehousing applications.
The major new feature of Oracle 8i is the inclusion of Java and Internet capabilities in
the database. Developers can now create an application in Java, PL/SQL, C or C++.
Another major feature of Oracle 8i is the Internet File System (IFS), which enables the
database to store and manage the relational and nonrelational data files. Oracle 8i comes
with Oracle inter Mediate which can manage and access multi-media data, including
audio, video, image, text, and spatial information. All these allow the database to be used
to host Internet applications and content, in addition to serving as repositories for
traditional relational data in workgroup and enterprise files.
14
2.4 Data Mining
The success of computerized data management has resulted in the accumulation of
huge amounts of data in several organizations. There is a growing perception that
analyses of these large databases can turn this “passive data” into useful “actionable
information”. The recent emergence of data mining, which is also referred to as
knowledge discovery in databases, is a testimony to this trend.
The objective of data mining is to identify the topics of unstructured Web documents.
In addition to answering specific queries (information retrieval), data mining techniques
also support unsupervised or supervised categorization that classifies a document for
inclusion in an indexed text database. Data mining methods can be divide into two
categories: the probabilistic approach and the deterministic approach.
The non-deterministic models are based on the assumption that web documents are
better described with statistical models because of the diversity and unstructured
character of hypertext documents. Statistical methods usually deal with only the text
portion of the document. The applications of the statistical models appear in both
supervised learning (also known as classification), and unsupervised learning (also
known as clustering). In the first situation, the learner first receives training data where
each term is tagged with a label from a finite discrete set. The algorithm is first trained
using this data set and then exposed to unlabeled data for a guess of possible labels. In the
case of unsupervised learning, on the other hand, the algorithm does not receive any
training data but instead is presented with a set of unlabeled documents. It is expected to
15
discover a hierarchy among these documents based on a measure of similarity between
documents. Using the tree analogy, related documents will appear close to each other
near the leaf level of the hierarchy, while dissimilar documents do not merge until they
are close to the root.
The strength of the probabilistic methods lies in the fact they can systematically
model the text information in a document and they are more effective in analyzing larger
documents, where the noise effect is lower. The drawback to probabilistic approaches,
however, lies in their inability to model meta-data such as hyperlinks. Also, the
complexity of the models sometimes degrades their execution speed. Finally, since a
particular web page can be authored in any language, dialect, or style by an individual
with any background, culture, motivation, interest or level of education, accurate
statistical models are very difficult to establish and sometimes a compromise must be
made between speed and complexity.
Deterministic methods in data mining take a different approach. Here, more effort is
spent exploring the structure of the documents, including the link structure. The
hyperlinks of the web offer rich latent information that does not usually appear in a text
document. An important application which takes advantage of this is the Hyperlink
Induced Topic Search (HITS) [8,9,10].
The Hyperlink Induced Topic Search, or HITS, does not crawl the web but relies on a
search engine. The data that the HITS method processes comes from a search engine and
the ranking is performed on a subgraph of the web that includes pages that match the
query from a search engine and pages citing or cited by these pages. The iteration process
of the HITS algorithm has been shown to produce equivalent results to those obtained
16
through a power series iteration. The main drawback of the basic HITS algorithm is that
all the edges of the graph are of the same weight and this can lead to topic contamination
during the expansion of the subgraph with cited and citing pages of the results from the
search engine query. An improved method, called the Automatic Resource Compilation
(or ARC) [11], incorporates a variable edge weight with an add-on penalty function that
favors occurrence of the keywords near the anchor text. Since the HITS and ARC
algorithms depend on a search engine query result, it is slower than the Google search
engine, which simulates a random walk on the Web graph with the purpose of estimating
the page ranking of the pages in the Web graph related to a specific query. However,
strategies such as caching a pre-crawled graph can greatly improve the performance. The
Related-Page Algorithm is developed from HITS, but is different in several important
ways. This algorithm will be discussed in detail in Chapter 4.
17
Chapter 3 System Development
In this chapter, implementation of the project will be described. There are three main
parts of the project. The first is the spider, the second involves saving objects to the
database, and the final step includes making the vicinity graph, calculating the hub and
authority values and returning the results list of the best URLs.
3.1 System Overview
Figure 3.1 shows an architectural diagram of the project. The first phase is mainly
concerned with searching and obtaining hyperlinks from related WebPages and saving
these URLs and their parents and children information into the database. For their
research, Jeffrey Dean and Monika Henzinger [5] were fortunate to have access to
Compaq’s Connectivity Server. The Connectivity Server provides high-speed access to
the graph structure of 180 million URLs (nodes) and the links (edges) that connect them.
The entire graph structure is stored on a Compaq AlphaServer 4100 with 8 GB of main
memory and dual 466 MHz Alpha processors. Because neither this kind of server nor the
large amount of memory needed to save the graph structure is available for this project,
18
Figure 3.1 Program architecture
Figure 3.1 Program architecture
we were only able to simulate this process. We created a small web source and saved the
information the spider retrieved to a local database. To do this, our spider was first used
to crawl web pages from a seed URL. To get a sufficient number of related URLs, we
sent several web addresses, which tend to refer to each other, such as
http://www.auburn.edu/ and http://www.lib.auburn.edu/ as seeds from which to the spider
to collect several hundred related URLs. We saved this set to the local database. It
should be mentioned that it is possible to crawl and save URL information from the entire
Web into a database, if there is enough memory and time available.
The second phase involves retrieval of the information stored in the local database
and subsequently applies the Companion algorithm. A query is submitted to a Web
User interface
Spider Downloading
Database
Vicinity Graph
Ranking
Related pages
Users
Query interface
19
interface, which is then sent to the database through CGI/Perl script and a JDBC
program. The web URLs that are nearby to the query are retrieved and the vicinity graph
is built according different layer and the relation of parent and children. After assigning
edge weights and calculating the Hub and Authority scores, the most highly related pages
are returned depending on the sorting results. These steps are examined in more detail in
the following sections.
3.2 The Spider
A spider, also known as a crawler, is a program that automatically scans various web
sites and extracts URLs, keywords, links, text, etc. A spider also follows the links on a
site to find other relevant pages. It is used to crawl web pages to collect a number of
URLs, which are then saved into a database. Our spider was used only to extract URLs
from web pages and collected their parents and children information. These sources were
saved to a local database and are used to calculate the Companion Algorithm.
The spider starts from a seed URL given by the user. This seed URL is also the first
URL in the crawling list. The spider checks the URL, making sure that the URL has not
already been visited, and then inserts it into the crawling list and analyzes the Web
document. The number of hyperlinks from web pages that can be inserted into the
crawling list is specified via the “Number of Web pages downloaded”.
The spider is implemented by using the Java programming language and Lynx.
The program is launched using the command:
java MySpider seed file1 numberOfWebPages
20
Lynx is a general-purpose command line/terminal type distributed information browser.
It can be used to interactively (through a keyboard) access information on the WWW and
display HTML documents containing links to files residing on the local system with a
variety of options. For example:
lynx –dump –source <url> > filename
Option –dump: stores the formatted output of the document to either standard output or
to a file specified in the command line.
Option –source: works the same as –dump but outputs the HTML source file instead of
formatted text.
The MySpider class has a non-argument constructor and four member functions. The
main function is quite simple. Its purpose is to create a MySpider object and take user
inputs, which are URL variables, and specify the number of sites to search. This part of
the main function code is :
MySpider Spider = new MySpider();
try{
…
for (int k = 0; k < Num; k++){
Spider.Search(k, URL);
Spider.SetNode1();
Spider.inSertToDB();
}
…
}
21
…
finally{
Spider.r.exit(0);
}
The constructor creates three vector objects. One vector, URLT, stores the URLs
(crawling list) retrieved from the seed URL and its following pages. Vector Vnode stores
the objects, which include parents and children of the seed URL, from the class node.
public class node{
protected Vector parents;
protected String url;
protected Vector children;
…
}
The third vector, Vnode1, stores the objects of node1.
public class node1{
protected String parent;
protected String url;
protected String child;
…
}
Figure 3.2 shows the different structures of node and node1.
22
p1 p1 url c1
c1
p2 url c2 p2 url c2
p3 p3 url none
node node1
Figure 3.2 The pictures of class node and node1.
It also creates a Runtime object. The major function is the search function. It takes the
URL and the number of sites, an integer number, as input, reads the whole web
document, captures the URL names, and stores the data into vectors. It uses lynx
commands and the seed URL to dump the HTML source files. The code is
command = “lynx –dump –sourec” + url;
p = r.exec (command);
After storing URLs in the vector URLT, the search function also creates node class
objects, sets parents and children, and saves the objects into vector Vnode.
The setNode1() function changes the objects of class node to objects of class node1,
which are easily inserted into the database. The inSertToDB() function uses the database
utility function DButility.execQuery(query) to insert the information (parent, URL, child)
into the database.
23
3.3 Indexing
Our database is designed to be simple and illustrative. One table address is created to
store the hyperlink information. Each record has three fields: the URL, its parent, and its
child.
The address table was created by DropCreateTable.class, which includes the code:
stmt.execute ("CREATE TABLE address (parents VARCHAR2(300), url
VARCHAR2(300), children VARCHAR2(300))");
In this project, JDBC is used to implement the database processing. First, after the
related pages have been crawled, the URLs and their parents and children are stored in
the database. Second, it is necessary to check the pages downloaded on the web interface.
Finally, the parent and children information of the URL is retrieved from the database, a
vicinity graph is built, and the Companion algorithm used to find the related pages.
JDBC is used to connect Java and the database. The system is implemented by using
the Java programming language and JDBC. To access the correct database, the JDBC
program needs an appropriate driver to connect to the DBMS. The DriverManager is the
class responsible for selecting database drivers and creating a new database connection.
The following codes make it possible to set up a connection using a JDBC-ODBC bridge
driver:
DriverManager.registerDriver (new oracle.jdbc.driver.OracleDriver());
OracleConnection con = (OracleConnection) DriverManager.
24
getConnection(“jdbc:oracle:oci8:@csedb”,”ops$username”,”password”);
To execute an SQL statement, a statement object has to be created. The connection
object that was obtained from the call to DriverManager.getConnection, can be used to
create statement objects.
Statement stmt = con.creatStatement();
There are two functions in the class Dbutility. The execUpdate () function first
connects to the database using the above statements. Then the function takes the query as
its argument, for example: insert statement query or other query. The statement (query)
is executed to update the database. The recordCount is returned after updating. The
other function is execQuery (), which also takes the query as its argument. After
retrieving information from the database, the results are saved in an object of the form
ResultSet (rs). Finally, the function returns a two dimensional array String[][] which
contains all the information of parent, URL, and child. These two functions are used in
the Spider and Search classes to communicate with the database.
The Companion algorithm is used to find the related pages of the query URL. We
will discuss this part in detail in Chapter 4.
25
3.4 System Interface
The project interface is shown in Fig 3.3. There are three parts in this interface:
(i) My Spider, (ii) Checking the download, and (iii) My Search Engine.
Figure 3.4 is the interface of My Spider. Users first enter the seed URL and the
number of web pages to collect, then press the StartCrawling button. The spider software
will search the web pages starting from the seed URL and save the URLs and their
parents and children into the database. At the same time the spider also prints the search
results on the screen.
Figure 3.5 is the interface used to monitor the download. It is used to check the
information saved in the database. The number of URLs to be checked is entered in the
“URL number” box, then the “search” button is clicked. The software will retrieve the
URLs from the database and print the results in the table shown in the window.
Figure 3.6 is the interface for the Search Engine. The query URL is entered in the
“Starting URL” box, and the number of related URLs entered in the “Number of results”
box, then the “search” button is clicked. The system will retrieve data from the database,
make a vicinity graph for the seed URL, then return the number of related URLs after
applying the Companion algorithm.
30
Chapter 4 Related Page Algorithms
This chapter describes a Related Page Algorithm known as the Companion algorithm
[5]. This algorithm only exploits the hyperlink structure (i.e. graph connectivity) and does
not examine information about the content or usage of pages. In this chapter, the terms
“parent” and “child” are used to describe relationships between web pages. If there is a
hyperlink from page w to page v, we say that w is a parent of v and that v is a child of w.
The Companion algorithm takes as input a starting URL u and consists of four steps:
1. Build a vicinity graph for u.
2. Eliminate duplicates and near-duplicates from this graph.
3. Compute edge weights based on host to host connections.
4. Compute a hub score and an authority score for each node in the graph and
return the top ranked authority nodes. This phase of the algorithm uses a
modified version of the HITS algorithm originally proposed by Kleinberg
[6,7].
These steps are described in more detail in the subsections below.
31
4.1 Building the Vicinity Graph
Given a query URL u, it is possible to build a directed graph of nodes that are near to
u in the web graph. Graph nodes correspond to web pages and graph edges correspond to
hyperlinks. The graph consists of the following nodes (and the edges between these
nodes):
• u,
• up to B parents of u, and for each parent up to BF of its children different from u,
• up to F children of u, and for each child up to FB of its parents different from u.
Here is how we choose these nodes in detail:
Back (B), and back-forward (BF) nodes: If u has more than B parents, add B random
parents to the graph; otherwise add all of u’s parents. If a parent x of u has more than BF
+1 children, add up to BF/2 children pointed to by the BF/2 links on x immediately
preceding the link to u and up to BF/2 children pointed to by the BF/2 links on x
immediately succeeding the link to u (ignoring duplicate links). If page x has fewer than
BF children, we add all of its children to the graph. Note that this exploits the order of
links on page x. Thus, a rough graph of this step when B=3 and BF =4 looks like the
following:
A C D T W V X Y u Z G
32
Here A, C and D are the parents of u and T, W, V, X, Y, Z, and G are forward children of
these parents.
Forward (F), and forward-back (FB) nodes: If u has more than F children, add the
children pointed to by the first F links of u; otherwise, add all of u’s children. If a child
of u has more than BF parents, add the BF parents; otherwise, add all of the child’s
parents. For the case where F=3 and FB=4, the graph looks like the following:
P Q R S u E N T
C H I L M
The graph here shows that C, H, I, L, and M are the children of u. In addition to u, P, Q,
R, and S are the FB parents of C, and E, N, and T are the FB parents of L. u is the only
parent of H, I, and M. If there is a hyperlink from a page represented by node v in the
graph to a page represented by node w, and v and w do not belong to the same host, then
there is a directed edge from v to w in the graph. Edges between nodes on the same host
are omitted.
33
Example: Fig 4.1 shows an example of a vicinity graph construction.
B1 B2 B3
C1 C2 C3
D1 D2 D3
Fig 4.1 A vicinity graph example
As many different situations as possible were included in the vicinity graph. The relations
shown in Table 4.1 were used to substitute these nodes in the graph.
Table 4.1 The authority and hub values for the vicinity graph shown in Fig 4.1
B1: http://www.usa.com a=1 h=2
B2: http://www.china.com a=1 h=2
B3: http://www.eng.auburn.edu a=1 h=1
C1: http://www.voa.com /special a=2 h=1
C2: http://www.eng.auburn.edu/hu a=2 h=2
C3: http://www.auburn.edu/sun a=2 h=1
D1: http://www.auburn.edu a=2 h=2
D2: http://www.abc.com a=2 h=1
D3: http://www.cnn.com a=2 h=2
34
The arrow points to the node from its parent. We will discuss the program used to make
the vicinity graph in section 4.2.
4.2 Duplicate Elimination
After building this graph, the near-duplicates are combined. Two nodes are
considered to be near-duplicates if (a) they each have more than 10 links and (b) they
have at least 95% of their links in common. When two near-duplicates are combined,
their two nodes are replaced by a node whose links are the union of the links of the two
near-duplicates. This duplicate elimination phase is important because many pages are
duplicated across hosts (for example mirror sites or different aliases for the same page),
and allowing them to remain separate can greatly distort the search results obtained.
Given a query URL u, a directed graph is built of nodes that are near to u in the web
graph. The graph consists of the node u, B parents of u and for each parent up to BF of its
children other than u., up to F children of u, and for each child up to FB of its parents
other than u. Before making the vicinity graph, it is necessary to determine which nodes
are near u. Here, the Search class provides several functions. The function setNode()
takes the query URL u, and the number of B, F, BF, and FB as arguments. After
retrieving the results [][] from the database, we declare a vector array:
Vector [] Varray = new Vector [5];
Each vector of Varray element is used to save one layer of nodes near u. Varray[0] is for
B, Varray[1] for the URL itself, Varray[2] for F, Varray[3] for BF, and Varray[4] for FB.
35
The following code was used for the situation when u has more than B parents, as
mentioned in Chapter 3. B random parents are added to the graph.
Random R = new Random();
while ( Url.getParentSize () > B){
i = R.nextInt();
if (i < Url.getParentSize())
Url.removeParentAt(i);
}
Each object of node in every Vector of Varray thus saves the information concerning its
own URL, parents, and children retrieved from the database.
The vicinity graph was made with vector Gp which contains linklist. Each linklist
saves the URL as its head, while other nodes are used to save children of this URL. The
class Linklist implements a linkable interface, and includes constructor() and several
insert(), remove(), and length() member functions.
The makeGraph (Vector [] data) function searches all vector arrays, which are the
nodes near to u. Then the function saves all the information on the children of each URL
into each linkedlist headed by this URL, and returns a vector. The following is an
example of a piece of code for the makeGraph function.
for(int i=0;i<Gp1.length;i++){
for(int j=0;j<(Gp1[i]).size();j++){
node n = (node)((Gp1[i]).elementAt(j));
LinkedList list = new LinkedList();
String url = n.getUrl();
36
Linkable head = new LinkableString(url);
list.insertAtHead(head);
int m = 0;
while( m< n.getChildrenSize()){
String child = n.getChild(m);
Linkable lt = new LinkableString(child);
list.insertAtTail(lt);
m++;
}
Gp.addElement(list);
}
}
The remaining part of the makeGraph function performs the duplicate elimination.
4.3 Edge Weight Assignment
In this stage, a weight is assigned to each edge, using the edge weighting scheme of
Bharat and Henzinger [12]. An edge between two nodes on the same host (the host can be
determined from the URL-string) has weight 0. If there are k edges from documents on a
first host to a single document on a second host, each edge is given an authority weight of
1/k. This weight is used when computing the authority score of the document on the
second host. If there are l edges from a single document on the first host to a set of
37
documents on a second host, each edge has a hub weight of 1/l. This scaling prevents a
single host from having too much influence on the computation.
The resulting weighted graph is referred to as the vicinity graph of u.
The assignWeight() function inside the Search class takes the Gp vector as argument,
then calculates the Authority Weight and Hub Weight for each edge. Inside this function,
the host is retrieved from each owner. By using several loops, the problem of k edges
from documents on a first host to a single document on a second host, as discussed above,
can be solved. Finally the information in the vector returned by the assignWeight()
function represents the vicinity graph of u.
4.4 Hub and Authority Score Computation
In this step, the imp algorithm [12] computes hub and authority scores by using the
vicinity graph. The imp algorithm is a straightforward extension of the HITS algorithm
incorporating edge weights.
The reasoning behind the HITS algorithm is that a document that points to many
others is a good hub, and a document that many documents point to is a good authority. It
therefore follows that a document that points to many good authorities is an even better
hub, and similarly a document pointed to by many good hubs is an even better authority.
The HITS computation repeatedly updates hub and authority scores so that documents
with high authority scores are expected to have relevant content, whereas documents with
38
high hub scores are expected to contain links to sites with relevant content. The
computation of hub and authority scores is performed as follows:
Initialize all elements of the hub vector H to 1.0.
Initialize all elements of the authority vector A to 1.0.
While the vectors H and A have not converged:
For all nodes n in the vicinity graph N,
A[n] : = ! (n", n) # edges(N) H[n"] × authority_weight(n", n)
For all n in N,
H[n] : = ! (n", n) # edges(N) H[n"]× hub_weight(n", n)
This example will be examined in detail in Chapter 5 (test for the Companion
Algorithm).
Note that the algorithm does not claim to find all relevant pages, since there may be
some that have good content but have not been linked to by many authors of other sites.
The Companion algorithm then returns the nodes with the highest authority scores
(excluding u itself) as the pages that are most related to the start page u.
In the software, before computing the Hub and Authority scores, we declare class
nodeWeight:
public class nodeWeight{
private String host;
private String owner;
39
private double auth;
private double hub;
public Vector Hhub;
public double A;
}
For each URL in the vicinity graph, a nodeW object is created of class nodeWeight (host,
owner). The auth in the object is used to save authority weight, hub is used to save Hub
weight, and the vector Hhub is used to save all the hosts pointing to the URL. These hosts
in the Hhub vector are used to calculate A. Here, A is the A[n] value referred to in
Chapter 3. The Companion Algorithms is then used. Inside the assignWeight function,
the two vectors Relav and Auth are used to save ! H and ! A for each URL.
• Calculation of Hub: In the vicinity graph (vector Gp), each URL of a head node in
the linkedlist is checked to see how many URLs in the list are pointed to by this
head, and the result saved to vector Relav after removing sites with the same host,
both pointing to it and pointed to by it. Because all elements of the hub vector are
initialized to 1.0, the number of elements in the vector Relav is the H value for
this URL.
• Calculation of Auth: In the vicinity graph, a loop is included to check each URL
inside the linkedlist to see how many heads point to this URL, then the vector
Auth is used to save this value after removing the sites with the same host which
point to it. Because all elements of the Auth vector are initialized to 1.0, the
number of the elements in the vector Auth is the A value for this URL.The key
point of the Companion algorithm is that a URL which points to more URLs is a
40
good Hub, while a URL which is point to by more URLs is a good authority.
These two factors can be combined by function calA().
The cA value is the A[n] which is ! (n", n) # edges(N) H[n"] × authority_weight(n", n). The
Companion algorithm then returns the nodes with higher authority scores as the pages
that are most closely related to the start page u.
The other two functions in the Search class are used to rank the related pages. The
VtoArray () function converts the final vector to an array for sorting, and the
Bubble_Sort() function sorts the array depending on the A value in the nodeWeight
object.
4.5 An Example
To test the Companion Algorithm, we used the example in Fig 4.1 to calculate the
Hub and Authority scores by running our system. The graph was saved into the database
by inserting the parent and child relationships shown in Table 4.2. Every line (“parent url
child”) represents one relation in the node1 class object.
After saving Table 4.2 into the database, the related pages were retrieved by entering
the following query.
> java Search http://www.eng.auburn.edu/hu/
The program built the graph, calculated the Authority Weight by using the Companion
Algorithm, and then returned the related pages ranked according to their A values. Figure
4.2 shows the results, with the Authority Weight for each node in the vicinity graph.
41
Table 4.2 URLs and their Parent and child URLs (format: parent URL, space, URL,
space, child URL).
http://www.voa.com/special http://www.usa.com http://www.china.com
none http://www.usa.com http://www.auburn.edu
http://www.usa.com http://www.china.com http://www.eng.auburn.edu
none http://www.china.com http://www.voa.com/special
none http://www.china.com http://www.eng.auburn.edu/hu
http://www.china.com http://www.eng.auburn.edu http://www.auburn.edu/sun
http://www.china.com http://www.voa.com/special http://www.usa.com
http://www.auburn.edu http://www.voa.com/special none
http://www.china.com http://www.eng.auburn.edu/hu http://www.auburn.edu/sun
http://www.voa.com/special http://www.eng.auburn.edu/hu http://www.cnn.com
http://www.abc.com http://www.eng.auburn.edu/hu http://www.auburn.edu
http://www.eng.auburn.edu http://www.auburn.edu/sun http://www.cnn.com
http://www.eng.auburn.edu/hu http://www.auburn.edu/sun none
http://www.cnn.com http://www.auburn.edu/sun none
http://www.usa.com http://www.auburn.edu http://www.voa.com/special
http://www.eng.auburn.edu/hu http://www.auburn.edu http://www.abc.com
http://www.auburn.edu http://www.abc.com http://www.eng.auburn.edu/hu
http://www.cnn.com http://www.abc.com none
http://www.eng.auburn.edu/hu http://www.cnn.com http://www.abc.com
http://www.auburn.edu/sun http://www.cnn.com http://www.auburn.edu/sun
From these test results, we can see:
1. The results (authority weight A) from the program running the Companion
Algorithm are the same as those obtained by manual calculation. For example:
http://www.abc.com/ points to one other node, so its hub =1.0. Two other nodes,
42
http://www.auburn.edu/ and http://www.cnn.com/, point to it, so the auth value of
http://www.abc.com/ is 2. When calculating A[n] for http://www.abc.com/, we need
Figure 4.2 Search results of the query: http://www.eng.auburn.edu/hu/. The A is the
value A[n] which is used for Ranking.
43
to combine (!) the authority value of http://www.auburn.edu/, which is 2, and
http://www.cnn.com/, which is 2. The A[n] for http://www.abc.com/ is calculated by
multiplying hub for http://www.abc.com/ times the authority of http://www.auburn.edu/
plus the hub for http://www.abc.com/ times the authority of http://www.cnn.com/. That
is, 1×2 +1×2 =4. This confirms that our program executes the Algorithm correctly.
2. The program successfully executed the Duplicate Elimination step. For example,
for http://www.eng.auburn.edu/hu, a is 2 rather than 3. If the fact that
http://www.auburn.edu and http://www.auburn.edu/sun share the same host had not been
taken into account, the a value for http://www.eng.auburn.edu would be 3.
3. In this testing process, the B, BF, F and FB values were each set to 10. Thus, the
vicinity graph includes all the nodes listed here. It is not possible to discern the effect of
these values on the Companion Algorithm. This issue will be discussed in detail in the
next chapter.
44
Chapter 5 Experimental Results
Our experiments were conducted using an Ultra 5 Sun Workstation, which is part of
the engineering network, running Solaris 2.6. Perl5, JDK 1.1.7, and Oracle 8.1.1 were
among the tools used for this project.
5.1 The Spider
In this experiment, real web pages were explored by our spider to first build up a
database, then queries were made to retrieve data from this database and the Companion
algorithm was applied. First, the spider searched the Web and downloaded URLs and
parents and children relations. Fig 5.1 shows the interface of the spider with seed URL
http://www.eng.auburn.edu/, and Fig 5.2 shows the crawling results obtained.
46
Figure 5.2 The crawling results of the spider from the seed URL
http://www.eng.auburn.edu/
Lacking the resources of other research groups [5], the scale of this research was
necessarily more limited. However, several seed URLs were selected which share some
relationships to build up our database. The following five URLs were selected as seed
URLs for our spider to use to download URLs.
47
http://www.auburn.edu/
http://www.lib.auburn.edu/
http://www.eng.auburn.edu/
http://www.theplainsman.com
http://www.oasis.edu/
The web page of each URL was searched and three further pages the spider crawled to
from this page were also searched, and the URLs obtained from these pages were saved
to the database named “address”. In this way, about 600 lines of data of URL
relationships retrieved from these pages were saved in the database.
5.2 Checking the Download
After inserting the data into the Oracle database, we were able to check the
content of the database by checking the download interface. By using the “Checking
Download” interface, the user can enter the number of URLs checked (Fig 5.3). Then the
system returns the URLs saved in the database. Figure 5.4 shows the URLs saved in the
database.
50
5.3 Searching and Ranking
The experimental results were obtained by running the command
> java Search retrievingURL B BF F FB
where B, BF, F, and FB are the number of nodes in a specified layer of the vicinity graph.
They represent “Go back,” “Go back-forward,” “Go forward,” and “Go forward-back.”
Example 1: Figure 5.5 shows the results of the query obtained from the command:
java Search http://www.oasis.auburn.edu 500 500 500 500
With these B, BF, F, FB values, all the nodes inside the database were included. The
search program gives http://www.auburn.edu/ as the closest related page.
51
Figure 5.5 Search results of the query: URL = http://www.oasis.auburn.edu/, B=500,
BF=500, F=500, FB=500.
Example 2: The second experiment was conducted using the query:
java http://www.lib.auburn.edu/ 500 500 500 500
The sorted results are saved in Fig 5.6, with the closest related page identified as
http://www.univrel.edu/.
52
Figure 5.6 Search results of the query: URL = http://www.lib.auburn.edu/, B=500,
BF=500, F=500, and FB=500.
From these Figures, we find one query sometime returns several related URLs with the
same value for the authority. For example, the query http://www.oasis.auburn.edu/
returns both http://www.auburn.edu/ and http://www.lib.auburn.edu/ with 12 as the
authority. The reason might be that our database information is limited. When the
number of URLs in the database is increased, this problem is likely to disappear.
53
To investigate the effects of different B, BF, F, and FB values, 300 and 100 were also
used for these values in the query http://www.oasis.auburn.edu/.
java http://www.oasis.auburn.edu 300 300 300 300
java http://www.oasis.auburn.edu 100 100 100 100
java http://www.oasis.auburn.edu 100 5 100 5
The results for these experiments are the same. This implies that we cannot test the B,
BF, F, FB effect from our limited resources. When a large database is available, these
variables will have an effect on the authority. Dean and Henzinger proposed [5] that
using a large value for B (2000) and a small value for BF (8) works better in practice than
using moderate values for each (say, 50 and 50).
Figure 5.7 shows the search results for the query http://www.eng.auburn.edu .
The system returns http://aos.auburn.edu/ as the highest rank. To test if it is reversible,
we sent http://aos.auburn.edu/ as the query. Figure 5.8 shows the results. Comparing
both figures, we can see the related pages are irreversible. Because the related pages
depend on the links, and different web pages have different links, these results are
reasonable.
54
Figure 5.7 Search results for the query http://www.eng.auburn.edu/
Given the limitations of our small database, it is difficult to directly compare our
search engine with other search engines. For broad topics such as news, football etc,
other search engines give several hundreds or even thousands of addresses. Even for the
query “Auburn University”, the http://www.excite.com/ returns 195,540 web sites. One
page usually leads to a hundred more pages. To check all these pages is not feasible with
55
Figure 5.8 Search result of the query http://aos.auburn.edu/
our limited database. However, this project has successfully demonstrated the use of the
Companion algorithm to rank the URLs which are returned in a search.
The Companion algorithm used the HITS algorithm as a starting point, but improved
on it in several ways. For example, our vicinity graph is structured to exclude grandparent
nodes but to include nodes that share a child with u. These nodes are believed to be more
likely to be related to u than the grandparent nodes. We also exploited the order of the
links on a page to determine which “siblings” of u to include. The other improvement is
56
the techniques whereby nodes in our vicinity graph that have a large number of duplicate
links are merged.
57
Chapter 6. Conclusions
1. The ranking web page can be obtained by using only hyperlink information from
each web page.
2. This project presented a Companion algorithm for finding related pages in the
World Wide Web. The algorithms developed can be implemented efficiently and
are suitable for use within a large-scale web service providing a related pages
feature.
3. The B, F, BF, FB values obtained from the vicinity graph for the certain query
have an effect on the performance of the search engine.
4. The disadvantage of this project is that our database was too small to test more of
the parameters concerned with this algorithm. This should be solved when it is
possible to access a server with a larger database.
58
References
1. Steve Lawrence and C. Lee Giles, Accessibility of information on the web,
Nature, Vol. 400, pages 107-109, 1999.
2. P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow’s ear: Extracting usable
structures from the web. In Proceedings of the Conference on Human Factors in
Computing Systems (CUI 96), pages 118-125, April 1996.
3. J. Carriere and R. Kazman. Webquery: Searching and visualizing the web
through connectivity. In Proceedings of the Sixth International World Wide Web
Conference [1], pages 701-711.
4. Loren Terveen and Will Hill. Evaluating emergent collaboration on the web. In
Proceedings of ACM CSCW’98 Conference on Computer-Supported Cooperative
Work, Social Filtering, Social Influences, pages 355-362, 1998.
5. Jeffrey Dean and Monika R. Henzinger. Finding Related Pages in the World Wide
Web (1999). In Proceedings of 8th www conference, Toroto, Canada.1999
6. J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings
of the Seventh International World Wide Web Conference [2], pages 259-270.
7. J. Kleinberg. Authoritative sources in a hyperlinked environment. In
Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms,
pages 668-677, January 1998.
59
8. J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-
SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of
the ACM 46 (1999).
9. J. Kleinberg, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. The web
as a graph: measurements, models and methods. Invited survey at the
International Conference on Combinatorics and Computing, 1999.
10. S. Chakrabarti, et al. Mining the Web’s link structure. IEEE Computer, (32) 8,
pages 60-67, August 1999.
11. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, S. Rajagopalan.
Automatic resource list compilation by analyzing hyperlink structure and
associated text. In Proc. 7th International World Wide Web Conference, 1998.
12. K. Bharat and M. Henzinger. Improved algorithms for topic distillation in
hyperlinked environments. In Proceedings of the 21st International ACM SSIGIR
Conference on Research and Development in Information Retrieval (SIFIR ’98),
pages 104-111, 1998.