advanced googolgy based on a presentation by patrick douglas crispen california state university,...

42
Advanced Googolgy Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach http://www.youtube.com/watch?v=JFyNzrxQCus http://netsquirrel.com/crispen/about_crispen.h

Upload: owen-carter

Post on 28-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Advanced GoogolgyAdvanced Googolgy

Based on a presentation byPatrick Douglas Crispen

California State University, Long Beachhttp://www.youtube.com/watch?v=JFyNzrxQCus

http://netsquirrel.com/crispen/about_crispen.html

Page 2: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Our GoalsOur Goals

• Learn how Google really works.• Discover some Google secrets no one

ever tells you.• Play around with some of Google’s

advanced search operators.• Find out where to get more Google-

related help and information.• DO ALL OF THIS IN ENGLISH!

Page 3: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Part One: Part One: How Google How Google REALLYREALLY

WorksWorksOr, at least, how I think

Google really works.

Page 4: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

One Word of WarningOne Word of Warning

• For obvious reasons, the folks at Google would rather the Wizard of Oz stay behind the curtain, so to speak.

• So, what you are about to see on the next few slides are just plain guesses on my part.

• And, my guesses are probably completely wrong! But they’re pretty. And that’s all that matters.

Page 5: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

How Google Works - PhrasesHow Google Works - Phrases

• When you search for multiple keywords, Google first searches for all of your keywords as a phrase. I think.

• So, if your keywords are disney fantasyland pirates, any pages on which those words appear as a phrase receive a score of X.

Image source: Google

Source: Google Hacks, p. 21

Page 6: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

How Google Works - AdjacencyHow Google Works - Adjacency

• Google then measures the adjacency between your keywords and gives those pages a score of Y.

• What does this mean in English? Well …

Image source: Google

Source: Google Hacks, p. 21

Page 7: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

How Adjacency WorksHow Adjacency Works

A page that says

“My favorite Disney attraction, outside of Fantasyland, is Pirates of the Caribbean”

will receive a higher adjacency score than a page that says

“Walt Disney was a both a genius and a taskmaster. The team at WDI spent many sleepless nights designing Fantasyland. But nothing could compare to the amount of Imagineering work required to create Pirates

of the Caribbean.”

Page 8: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

How Google Works - WeightsHow Google Works - Weights

• Then, Google measures the number of times your keywords appear on the page (the keywords’ “weights”) and gives those pages a score of Z.

• A page that has the word disney four times, fantasyland three times, and pirates seven times would receive a higher weights score than a page that only has those words once.

Source: Google Hacks, p. 21

Page 9: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Putting it All TogetherPutting it All Together

• Google takes– The phrase hits (the Xs), – The adjacency hits (the Ys), – The weights hits (the Zs), and – About 100 other secret variables

• Throws out everything but the top 2,000• Multiplies each remaining page’s

individual score by it’s “PageRank”• And, finally, displays the top 1,000 in

order.

Page 10: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

PageRank?PageRank?

• There is a premise in higher education that the importance of a research paper can be judged by the number of citations the paper has from other research papers.

• Google simply applies this premise to the Web: the importance of a Web page can be judged by the number of hyperlinks pointing to it from other pages.

• Or, to put it mathematically [brace yourself – the next slide contains the intimidating-looking equation I warned you about] …

Source: Google Hacks, p. 294

Page 11: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

The PageRank AlgorithmThe PageRank Algorithm

)(

)(...

)1(

)1()1()(

TnC

TnPR

TC

TPRddAPR

Where

• PR(A) is the PageRank of Page A

• PR(T1) is the PageRank of page T1

• C(T1) is the number of outgoing links from the page T1

• d is a damping factor in the range of 0 < d < 1, usually set to 0.85

Source: Google Hacks, p. 295

Page 12: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

You Can Start Breathing AgainYou Can Start Breathing Again

• I promise there are no more equations in this presentation.

• I just wanted to show you that the PageRank of a Web page is the sum of the PageRanks of all the pages linking to it divided by the number of links on each of those pages.– A page with a lot of (incoming) links to it is

deemed to be more important than a page with only a few links to it.

– A page with few (outgoing) links to other pages is deemed to be more important than a page with links to lots of other pages.

Source: Google Hacks, p. 295

Page 13: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Part One: In SummaryPart One: In Summary

• Google first searches for your keywords as a phrase and gives those hits a score of X.

• Google then searches for keyword adjacency and gives those hits a score of Y.

• Google then looks for keyword weights and gives those hits a score of Z.

• Google combines the Xs, the Ys, the Zs, and a whole bunch of unknown variables, and then weeds out all but the top 2,000 scores.

• Finally, Google takes the top 2,000 scores, multiplies each by their respective PageRank, and displays the top 1,000.

• I think.

Page 14: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Part Two:Part Two:More Stuff No One Tells YouMore Stuff No One Tells You

Google’s shocking secrets revealed!

Page 15: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Many keywords and phrasesMany keywords and phrases

• You can search for several keywords• To search for phrases, just put your phrase in

quotes.• For example, disney fantasyland “pirates of the caribbean”– This would show you all the pages in Google’s

index that contain the word disney AND the word fantasyland AND the phrase pirates of the caribbean (without the quotes)

– And then pages that include 2 (or 1) of these• By the way, while this search is technically

perfect, my choice of keywords contains a (deliberate) factual mistake. Can you spot it?

Source: http://www.google.com/help/refinesearch.html

Page 16: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Arr, She Blows!Arr, She Blows!

• Pirates of the Caribbean isn’t in Fantasyland, it’s in Adventureland in Orlando and Paris, and New Orleans Square in Anaheim.

• So searching for disney AND fantasyland AND “pirates of the caribbean” probably isn’t a good idea.

• YOU have to select sensible search terms!

Image source: http://www.balgavy.at/

Page 17: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

How Insensitive!How Insensitive!

• Google is not case sensitive. • So, the following searches all yield

exactly the same results: – disney fantasyland pirates– Disney Fantasyland Pirates– DISNEY FANTASYLAND PIRATES– DiSnEy FaNtAsYlAnD pIrAtEs

Source: http://www.google.com/help/basics.html

Page 18: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Google’s 10 Word LimitGoogle’s 10 Word Limit

• Google didn’t accept more than 10 keywords at a time.

• Any keyword past 10 was ignored.• Why search for a long list of words?• (Baroni & Bernardini 2004) wanted to

find new Italian medical terms. They had a list of 100+ known medical terms; used Google to find docs with some of these – then examined the new words in these docs…

Page 19: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Searching for more than 10 Searching for more than 10

• How could you get around this limit? – Stopwords– Wildcards– BootCat and Google API

Page 20: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Stop WordsStop Words

To enhance the speed and relevancy of your Web search, Google routinely and automatically ignores common words and characters known as “stop words.”

Source: http://www.google.com/press/guide/reviewguide_7.html

Page 21: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Stop _ _ Name _ LoveStop _ _ Name _ Love

• This is certainly not a canonical list, but here are 28 stop words I know about.

• a, about, an, and, are, as, at, be, by, from, how, i, in, is, it, of, on, or, that, the, this, to, we, what, when, where, which, with

• You can force Google to search for a stop word by putting a + in front of it, for example pirates +of +the caribbean

• Compare knowledge of managementKnowledge +of management

Source: 10/23/02 post by Bill Todd to news:google.public.support.general

Page 22: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Dealing with the 10 Word LimitDealing with the 10 Word Limit

• Omit the stop words in your search terms and you’ll probably never run into the 10 word limit.

• 2 other ways around the limit: use wildcards;use BootCat and Google API

Image source: http://www.alloyd.com/

Page 23: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Google and WildcardsGoogle and Wildcards

• Google doesn’t support stemming.• Rather, Google offers full-word wildcards.• For example, if you search Google for it’s +a * world, Google shows you all of the pages in its database that contain the phrase “it’s a small world” … and “it’s a nano world” … and “it’s a Linux world” … and so on.

Source: Google Hacks, p. 37

Page 24: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

it’s +a * worldit’s +a * world

• The + before a is required because it is a stop word and would otherwise be ignored.

• Most of the hits are phrases because that’s what Google looks for first.

• Oh, and I defy you to get that song out of your head!Image source: http://themeparksource.com/

Page 25: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Wildcards and the Word LimitWildcards and the Word Limit

• Remember when I said that one way to get around the 10 word limit was to use wildcards?

• Google doesn’t count wildcards toward the limit.

• For example, Google thinks that though * mountains divide * * oceans * wide it's * small world after all is exactly 10 words long.

Source: Google Hacks, p. 19

Page 26: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

BootCat and Google APIBootCat and Google API

• How can I find documents with (some of) a LONG list of keywords?

• E.g. a list of medical terms: docs with (some of) these may include NEW medical terms I didn’t know

• BootCat (Baroni & Bernardini 2004) generates random sub-lists (up to 10 keywords from the list), calls Google API to find docs, downloads to collect a CORPUS (Web-as-Corpus research)

Page 27: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

The The OrderOrder of Your of Your Keywords MattersKeywords Matters

• When you conduct a search at Google, it searches for– Phrases, then– Adjacency, then– Weights.

• Because Google searches for phrases first, the order of your keywords matters.Image source: Google

Source: Google Hacks, p. 20-22

Page 28: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

For ExampleFor Example

A search for disney fantasyland pirates yields the same number of hits as a search for fantasyland disney pirates, but the order of those hits – especially the first 10 – is noticeably different.

Page 29: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Part Two: In SummaryPart Two: In Summary

• You can search with several keywords.• Capitalization does not matter.• Google has a hard limit of 10 keywords

PER QUERY.• Google ignores a BUNCH of common

words – STOPLIST.• Google does support wildcard searches …

sort of.• BootCat takes a LONG list of keywords and

sends subsets to Google API, to gather a CORPUS of matching web-pages.

• The order of your keywords matters.

Page 30: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Part Three:Part Three:Advanced Search OperatorsAdvanced Search Operators

Beyond plusses, minuses, ANDs, ORs, quotes, and *s

Page 31: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

filetype:filetype:

• filetype: restricts your results to files ending in ".doc" (or .xls, .ppt. etc.), and shows you only files created with the corresponding program.

• There can be no space between filetype: and the file extension

• The “dot” in the file extension – .doc – is optional.

Source: http://www.google.com/help/faq_filetypes.html

Page 32: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

filetype:extension

pirates filetype:pdf

pirates -filetype:pdf

Page 33: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

intitle:intitle:

• intitle: restricts the results to documents containing a particular word in its title.

• There can be no space between intitle: and the following word.

• You can also search for phrases. Just put your phrase in quotes.

Source: http://www.google.com/help/operators.html

Page 34: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Title?Title?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html><head>

<title>

Pirates of the Caribbean

</title>

</head>

<body> ...

Page 35: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

intitle:terms

intitle:pirates

intitle:”knowledge management”

Page 36: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

A Quick QuestionA Quick Question

• What would happen if I searched forintitle:walt disney (without the quotes)?

• Google would look for every page with the world walt in its title AND the word disney somewhere in its body.

• Remember, the quotes are kind of important if you want to search for phrases using intitle:

Page 37: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

site:site:

• site: restricts the results to those websites in a domain.

• There can be no space between site: and the domain.

Source: http://www.google.com/help/operators.html

Page 38: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

site:domain

pirates site:disney.com

pirates site:uk

Pirates site:cnpirates site:www.comp.leeds.ac.uk

ramadan site:www.comp.leeds.ac.uk/nora/html

Page 39: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Some more operatorsSome more operators

Query modifiers• daterange:• filetype:• inanchor:• intext:• intitle:• inurl:• site:

Alternative query types• cache:• link:• related:• info:

Other information needs• phonebook:• stocks:

Page 40: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

More Google ServicesMore Google Services

http://www.google.com/

And see “more”… e.g.

http://code.google.com/ Open-source code, API (e.g.

Leeds FYPs!)

http://labs.google.com/ Latest research

“graduates” include:

http://scholar.google.com/ Find research papers (e.g.

background survey)

http://books.google.com/ Find quotes in books

http://maps.google.com/ Maps, routefinder…

Page 41: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

http://www.google.com/help/http://www.google.com/help/

• Google Help Central

• Free guides and FAQs that tell you about Web searching in general and Google’s features in specific.

Page 42: Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Our GoalsOur Goals

• Learn how Google really works.• Discover some Google secrets no one

ever tells you.• Play around with some of Google’s

advanced search operators.• Find out where to get more Google-

related help and information.• DO ALL OF THIS IN ENGLISH!