page ranking algorithm

18
“PAGE RANKING” ALGORITHM

Upload: jav7zaid

Post on 15-Jul-2015

298 views

Category:

Education


1 download

TRANSCRIPT

“PAGE RANKING”

ALGORITHM

INTRODUCTION

• Finding useful information on the World Wide Web is something many of us take for granted. According to the Internet research firm Netcraft, there are nearly 150,000,000 active Web sites on the Internet today.

• Google's algorithm does the work for you by searching out Web pages that contain the keywords you used to search, then assigning a rank to each page based several factors, including how many times the keywords appear on the page. Higher ranked pages appear further up in Google's search engine results page (SERP), meaning that the best links relating to your search query are theoretically the first ones Google lists.

• Automated programs called spiders or crawlers travel the Web, moving from link to link and building up an index page that includes certain keywords. Google references this index when a user enters a search query. The search engine lists the pages that contain the same keywords that were in the user's search terms.

• Also like other search engines, Google has a large index of keywords and where those words can be found. What sets Google apart is how it ranks search results, which in turn determines the order Google displays results on its search engine results page (SERP). Google uses a trademarked algorithm called PageRank, which assigns each Web page a relevancy score.

• Keyword placement plays a part in how Google finds sites. Google looks for keywords throughout each Web page, but some sections are more important than others. Including the keyword in the Web page's title is a good idea, for example. Google also searches for keywords in headings.

How to decide which page is to be selected and which has to be left out, google does this by asking questions 200 of them, few important ones are:

i. How many time the keyword is contained in the page ? i.e. frequency of the word in the page

ii. Do words appear in title ,URL, directly adjacent, meta tag?

iii. Does page include Synonyms..

iv. Page from quality website, low quality,…

v. Page rank?

PAGERANKING ALGORITHM

• Google’s PageRank algorithm has become one of the most famous in computer science. It was originally designed to rank websites according to their importance by assuming that a site is important if it is linked to by other important sites it follows the real life philosophy that

“How does a product or an individual get popular when people other than the individual know about that individual or product “

which is similar to page ranking of a page when other webpages has a link to the specific web page.

• The algorithm works by counting the links to a website and the importance of the sites these come from. It then uses this to work out the importance of the original site. Through a process of iteration, the algorithm comes up with a ranking.

• PageRank assigns a rank or score to every search result. The higher the page's score, the further up the search results list it will appear.

• Scores are partially determined by the number of other Web pages that link to the target page. Each link is counted as a vote for the target. The logic behind this is that pages with high quality content will be linked to, more often than mediocre pages.

• Not all votes are equal. Votes from a high-ranking Web page count more than votes from low-ranking sites. You can't really boost one Web page's rank by making a bunch of empty Web sites linking back to the target page.

• The more links a Web page sends out, the more diluted its voting power becomes. In other words, if a high-ranking page links to hundreds of other pages, each individual vote won't count as much as it would if the page only linked to a few sites.

• Other factors that might affect scoring include the how long the site has been around, the strength of the domain name, how and where the keywords appear on the site and the age of the links going to and from the site. Google tends to place more value on sites that have been around for a while.

A Web page's PageRank depends on a few factors:

• The frequency and location of keywords within the Web page: If the keyword only appears once within the body of a page, it will receive a low score for that keyword.

• How long the Web page has existed: People create new Web pages every day, and not all of them stick around for long. Google places more value on pages with an established history.

• The number of other Web pages that link to the page in question: Google looks at how many Web pages link to a particular site to determine its relevance.

• Out of these three factors, the third is the most important. It's easier to understand it with an example.

• Let's look at a search for the terms "Planet Earth.“

• As more Web pages link to Discovery's Planet Earth page, the Discovery page's rank increases. When Discovery's page ranks higher than other pages, it shows up at the top of the Google search results page.

PageRank description

We assume page A has pages T1...Tn which point to it .

The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.

The PageRank theory holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking.

The probability, at any step, that the person will continue is a damping factor d.

Various studies have tested different damping factors, but it is generally assumed that the damping factor will be set around 0.85.

Also C(A) is defined as the number of links going out of page A.

The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

the PageRank's form a probability distribution over web pages,

“so the sum of all web pages' PageRank's will be one”.

How is PageRank Calculated?

• The PR of each page depends on the PR of the pages pointing to it. But we won’t know what PR those pages have until the pages pointing to them have their PR calculated and so on… And when you consider that page links can form circles it seems impossible to do this calculation!

• the Google paper says:

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of

the web.

What that means to us is that we can just go ahead and calculate a page’s PR without knowing the final value of the PR of the other pages. That seems strange but, basically, each time we run the calculation we’re getting a closer estimate of the final value. So all we need to do is remember the each value we calculate and repeat the calculations lots of times until the numbers stop changing much.

Lets take the simplest example network: two pages, each pointing to the other:

Each page has one outgoing link (the outgoing count is 1, i.e. C(A) = 1 and C(B) = 1).

1. GUESS 1 d= 0.85

PR(A)= (1 – d) + d(PR(B)/1)

PR(B)= (1 – d) + d(PR(A)/1)

PR(A)= 0.15 + 0.85 * 1

= 1

PR(B)= 0.15 + 0.85 * 1

= 1

We don’t know what their PR should be to begin with, so let’s take a guess at 1.0 and do some calculations:

i.e.

2. GUESS 2

PR(A)= 0.15 + 0.85 * 0

= 0.15

PR(B)= 0.15 + 0.85 * 0.15

= 0.2775

PR(A)= 0.15 + 0.85 * 0.2775

= 0.385875

PR(B)= 0.15 + 0.85 * 0.385875

= 0.47799375

PR(A)= 0.15 + 0.85 * 0.47799375

= 0.5562946875

PR(B)= 0.15 + 0.85 * 0.5562946875

= 0.622850484375

Ok, let’s start the guess at 0 instead and re-calculate:

And again:

And again:

and so on. The numbers just keep going up. But will the numbers stop increasing when they get to 1.0? What if a calculation

over-shoots and goes above 1.0?

3. GUESS 3

Let’s start the guess at 40 each and do a few cycles:

PR(A) = 40

• Principle: it doesn’t matter where you start your guess, once the PageRank calculations have settled down, the “normalized probability distribution” (the average PageRank for all pages) will be 1.0

PR(A)= 0.15 + 0.85 * 40

= 34.25

PR(B)= 0.15 + 0.85 * 0.385875

= 29.1775

PR(A)= 0.15 + 0.85 * 29.1775

= 24.950875

PR(B)= 0.15 + 0.85 * 24.950875

= 21.35824375

First calculation

And again

PR(D)= (1-d) + d * (0)

= 0.15

no backlinks means the equation looks like this:

no matter what else is going on or how many times you do it.

Observation: every page has at least a PR of 0.15 to share out.

• Our home page has 2 and a half times as much PR as the child pages! Excellent!

• This is what we’d expect. All the pages have the same number of incoming links, all pages are of equal importance to each other, all pages get the same PR of 1.0 (i.e. the “average” probability).

EXAMPLES• Because Google looks at links to a Web page as a vote, it's not easy to cheat the system. The best way to make sure

your Web page is high up on Google's search results is to provide great content so that people will link back to your page. The more links your page gets, the higher its PageRank score will be. If you attract the attention of sites with a high PageRank score, your score will grow faster.

• Mega-sites, like http://news.bbc.co.uk have tens or hundreds of editors writing new content – i.e. new pages - all day long! Each one of those pages has rich, worthwhile content of its own and a link back to its parent or the home page! That’s why the Home page Toolbar PR of these sites is 9/10 and the rest of us just get pushed lower and lower by comparison…

• Principle: Content Is King! There really is no substitute for lots of good content…

Steps to a enhance your PAGERANK

1.Give visitors the information they're looking for

• Provide high-quality content on your pages, especially your homepage. This is the single most important thing to do. If your pages contain useful information,their content will attract many visitors and entice webmasters to link to your site. Think about the words users would type to find your pages and include those words on your site.

2. Make sure that other sites link to yours

• Links help our crawlers find your site and can give your site greater visibility in our search results. When returning results for a search, Google uses sophisticated text-matching techniques to display pages that are both important and relevant to each search. Google interprets a link from page A to page B as a vote by page A for page B.

3. Make your site easily accessible

• Build your site with a logical link structure. Every page should be reachable from at least one static text link.

BIBLIOGRAPHY

• http://www.google.com/googlebot

• www.wikipedia.org

• http://infolab.stanford.edu/~backrub/google.html

THANK YOU