analysis of link structures on the world wide web and classified improvements
DESCRIPTION
Analysis of Link Structures on the World Wide Web and Classified Improvements. Greg Nilsen University of Pittsburgh April 2003. The Problem. The web is a complex, unorganized structure. Search engines can be fooled: Search Engine Designers v. Advertisers - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/1.jpg)
Analysis of Link Structures on the World Wide Web and
Classified Improvements
Greg Nilsen
University of Pittsburgh
April 2003
![Page 2: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/2.jpg)
The Problem
The web is a complex, unorganized structure. Search engines can be fooled:
Search Engine Designers v. Advertisers User feedback rarely used to quantify results.
![Page 3: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/3.jpg)
Outline
Background Kleinberg’s Algorithm
The Idea Implementation Results and Conclusions References
![Page 4: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/4.jpg)
Background – Kleinberg’s Algorithm
Basic Idea: Create a Focused Subgraph of the Web Iteratively Compute Hub and Authority Scores Filter Out The Top Hubs and Authorities
Extended Ideas: Similar Page Queries Non-Principal Eigenvectors
![Page 5: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/5.jpg)
Background – Kleinberg’s Algorithm
Create a focused subgraph of the web (a base set of pages) Why? We need a set that is:
Relatively Small Rich in Relevant Pages Contains Most of the Strongest Authorities
![Page 6: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/6.jpg)
Background – Kleinberg’s Algorithm
Start with a root set: In our case we are using a data set that started
with the first 200 results of a text-based search on AltaVista.
Create our base set: Add in all pages that link to and from any page in
the root set.
![Page 7: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/7.jpg)
Background – Kleinberg’s Algorithm
Root
![Page 8: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/8.jpg)
Background – Kleinberg’s Algorithm
Root
![Page 9: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/9.jpg)
Background – Kleinberg’s Algorithm
Root
![Page 10: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/10.jpg)
Background – Kleinberg’s Algorithm
Root
![Page 11: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/11.jpg)
Background – Kleinberg’s Algorithm
Root
Base
![Page 12: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/12.jpg)
Background – Kleinberg’s Algorithm
Now that we have a focused subgraph, we need to compute hub and authority scores. Start by initializing all pages to have a hub and
authority weights of 1. Compute new hub and authority scores:
Hub Score = Σ (Authority Scores of All Pages The Hub Points At)
Authority Score = Σ (Hub Scores of All Pages That Point to the Authority)
![Page 13: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/13.jpg)
Background – Kleinberg’s Algorithm
Normalize the new weights (hubs and authorities separately) so that the sum of their squares is equal to one.
Repeat the computing of weights and their normalization until the scores converge (usually 20 iterations).
When we have completed computing the hub and authority scores, we then take the top authority scores as our top results.
![Page 14: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/14.jpg)
Background – Kleinberg’s Algorithm
Similar page queries Once we produce results, a searcher may wish to
find pages similar to a given result. In order to do this, we can use the algorithm that
we have discussed above. This time, we build a root set of the pages that
point to the given page. We then grow this into a base set and determine
the hubs and authorities for the new set. This will result in pages similar to the initial page.
![Page 15: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/15.jpg)
Background – Kleinberg’s Algorithm
Non-Principal Eigenvectors An eigenvector is a densely linked collection of
hubs and authorities within the subgraph. In the Kleinberg algorithm, we produce the
principal eigenvector by iteratively computing hub and authority scores until convergence.
However, the principal eigenvector may not contain all of the information desired by the search.
![Page 16: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/16.jpg)
Background – Kleinberg’s Algorithm
Example: A search for “jaguar” This search will produce 3 strong eigenvectors
due to different meanings of the word:1. Jaguar – the car2. Jaguar – the cat3. The Jacksonville Jaguars NFL team
Which one of these will be returned as the principal eigenvector depends heavily on the initial set of pages.
We cannot determine which of the three meanings that the searcher meant.
![Page 17: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/17.jpg)
Background – Kleinberg’s Algorithm
Therefore, we can produce results that come from “strong” eigenvectors.
However, we can still miss relevant pages. For example, the search for “WWW conferences”
produces the most pertinent results on the 11th non-principal eigenvector.
How to determine relevant eigenvectors is a topic that is still currently under research.
![Page 18: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/18.jpg)
The Idea
Kleinberg’s algorithm produces “good” results, but subject to a phenomena known as “topic drift”. The hub weights of some sites such as yahoo.com or eBay.com cause irrelevant clusters to be identified as major eigenvectors.
So, while structural information provides us with much information about a query, additional information seems necessary.
![Page 19: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/19.jpg)
The Idea
Kleinberg’s algorithm also uses only the top authority scores, but there may be useful pages that rank strongly as hubs.
Since web queries are an application driven towards maximizing user satisfaction, we can use user feedback to try and weight hub and authority scores so that we can classify “better” results using SVMs.
![Page 20: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/20.jpg)
The Idea
A plot of hub vs. authority scores.
Hubs
Authorities
![Page 21: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/21.jpg)
The Idea
Hubs
Authorities
the dividing hyperplane
![Page 22: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/22.jpg)
The Idea
We can then compile data from different types of searches, we may be able to generalize this hyperplane so that we pull more relevant results from the result of Kleinberg’s algorithm.
![Page 23: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/23.jpg)
Implementation
Start with data from the University of Toronto’s Link Analysis Ranking Algorithm repository. Getting results for a text-based search engine is
very difficult any more now that search engines have gotten smarter.
Contains data for 8 distinct types of searches.
![Page 24: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/24.jpg)
Implementation
Next, we implement Kleinberg’s algorithm in C++ that reads in the datasets and outputs a web page with the top 50 hubs and top 50 authorities on the page.
Compile a survey in which participants are asked if a result is useful for a mixture of the top 25 hubs and top 25 authorities for the search on “abortion” (a search that tends to produce two distinct groups) and “genetic” (a search that is more generic in nature).
![Page 25: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/25.jpg)
Implementation
Using the results of the survey, determine a class label (1 or 0) for each result.
With the resulting labels, perform learning via SVMs in Matlab using the hub and authority scores as input and the class label as output.
![Page 26: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/26.jpg)
Implementation
Using the weights resulting from the SVM learning and plug them into our initial program to compute SVMscores for all web pages.
Sort the web pages based on their SVMscores and output the top 50 results to a web page.
![Page 27: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/27.jpg)
Results
Mean Misclassification Rates For Entire Dataset of “genetic” and “abortion” Terms: Training = 0.3922 Testing = 0.4063
Mean Misclassification Rates for the Same Dataset Renormalized Against Largest Value in the Vector: Training = 0.3418 Testing = 0.3345
![Page 28: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/28.jpg)
Results
Mean Misclassification Rates for the “abortion” Dataset: Training = 0.1862 Testing = 0.1888
Mean Misclassification Rates for the Renormalized “abortion” Dataset : Training = 0.1878 Testing = 0.1850
![Page 29: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/29.jpg)
Results
Mean Misclassification Rates for the “genetic” Dataset: Training = 0.3530 Testing = 0.3566
Mean Misclassification Rates for the Renormalized “genetic” Dataset : Training = 0.3548 Testing = 0.3463
![Page 30: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/30.jpg)
Results
Mean Misclassification Rates for Individuals:
![Page 31: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/31.jpg)
Results
Mean Misclassification Rates Overall Dataset Expanded To Include (Auth/Hub, Auth2, Hub2, Auth*Hub): Training = 0.2886 Testing = 0.2903
Mean Misclassification Rates for the Renormalized “genetic” Dataset : Training = 0.2848 Testing = 0.2861
![Page 32: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/32.jpg)
Conclusions
While the current results provide significant improvement on some searches in our datasets, for some searches the results are not much of an improvement. This may be due to the fact that user feedback
was limited.
![Page 33: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/33.jpg)
Conclusions
We have shown that generalization in terms of the entire dataset and on a per person basis do not provide good results. Too many factors being combined with too few
features. We have also demonstrated that
generalization on a per search basis and the use of expanded datasets may provide for better learning of user preferences of results.
![Page 34: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/34.jpg)
References J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of
ACM (JASM), 46, 1999. A. Borodin, G. Roberts, J. Rosenthal, P. Tsaparas. Finding authorities and
hubs from link structures on the world wide web. Proceedings of the 10th International World Wide Web Conference, 2001.
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Proc. 7th WWW Conf., 1998.
R.A. Botafogo, E. Rivlin, and B. Shneiderman: Structural Analysis of Hypertexts: Identifying Hierarchies and Useful Metrics. ACM Transactions on Information Systems, Vol. 10, No. 2. ACM, 1992. pp. 142-180
J. Carrire and R. Kazman. WebQuery: Searching and Visualizing the Web through Connectivity, in Proceedings of WWW6 (Santa Clara CA, April 1997).
E. Garfield. Citation analysis as a tool in journal evaluation. Science,
178:471--479, 1972.
![Page 35: Analysis of Link Structures on the World Wide Web and Classified Improvements](https://reader035.vdocuments.net/reader035/viewer/2022062723/56813f97550346895daa87ff/html5/thumbnails/35.jpg)
References L. Katz. A new status index derived from sociometric analysis.
Psychometrika, 18:39-43, 1953. G. Pinski and F. Narin. Citation influence for journal aggregates of scientific
publications: Theory with application to literature of physics. Information Processing & Management, 12:297--312, 1976.
C.H. Hubbell. An input-output approach to clique identification. Sciometry 28, 377-399, 1965.