steve cassidy computing at macquarieno 1 searching the web steve cassidy centre for language...
DESCRIPTION
Steve Cassidy Computing at MacquarieNo 3 What is the Web? Documents, text, images, sound A web of hyperlinks –Link one (text) document to others Easy to join –Any Internet user can be a publisher Anarchic –No-one is in charge Very bigTRANSCRIPT
Steve Cassidy Computing at Macquarie No 1
Searching The Web
Steve CassidyCentre for Language
TechnologyDepartment of Computing
Macquarie University
Steve Cassidy Computing at Macquarie No 2
The First Web Page
Steve Cassidy Computing at Macquarie No 3
What is the Web?• Documents, text, images, sound• A web of hyperlinks
– Link one (text) document to others• Easy to join
– Any Internet user can be a publisher• Anarchic
– No-one is in charge• Very big
Steve Cassidy Computing at Macquarie No 4
The Problem• Much of the information
available is text-based• Text is difficult to process
by computers• The popular use of
computers and the Internet has increased the availability of text-based information
• Information Overload
Steve Cassidy Computing at Macquarie No 5
The Solution?
Only one of the top four commercial
search engines finds itself
The best navigation should make it easy to find almost anything on
the web (once all the data is entered)
The Web1997
Steve Cassidy Computing at Macquarie No 6
How do they work?
• Two major steps– Build an inverted index– Match query terms in the index
• Problems– The web is very big– Finding relevant documents– Avoiding false hits
Steve Cassidy Computing at Macquarie No 7
Inverted Index
document
D1 D2 D3
D1
D1 D3
D1
D2
computer
software
information
language
computersoftware
informationlanguage
computer
libraryretrieval
computerinformation
retrievalfiltering
D1
D2
D3document
Steve Cassidy Computing at Macquarie No 8
Building the Index
List of web addresses
Download web page Parse Web page
Index
New links Web pagetext
Steve Cassidy Computing at Macquarie No 9
Building the Index
List of web addresses
Download web page Parse Web page
Index
New links Web pagetext
<table width="70%" border="0" cellspacing="0" cellpadding="4"> <tr> <td style="background-color: #f1e1f1"> <a name="works"><b><font face="Arial, sans-serif">How Google Works </font></b></a></td> </tr>
</table>
<p><a name="howGoogleWorks">If you aren't interested in learning how Google creates the index andthe database of documents that it accesses when processing a query,skip this description. I adapted the following overview from ChrisSherman and Gary Price's wonderful description of How Search EnginesWork in Chapter 2 of <a ref="http://www.amazon.com/exec/obidos/tg/detail/-/091096551X/002-5190375-1505602">The Invisible Web</a> (CyberAge Books, 2001).</a><p><a name="fast"><a name="index">Google consists of three distinct parts, each of which is run on adistributed network of thousands of low-cost computers and cantherefore carry out fast parallel processing. Parallel processing isa method of computation in which many calculations can be performed simultaneiously, significantly speeding up dataprocessing.</a></a>
Steve Cassidy Computing at Macquarie No 10
Using the IndexD1 D2
D3
D1
D1 D3
computer
software
information
documentD1
D2
language
Query: computer software information
D1 D2 D3
D1 D3
D1
Steve Cassidy Computing at Macquarie No 11
Server Farm
http://www.microsoft.com/technet/archive/windows2000serv/plan/hiavsys.mspx
Over 10,000 computersEach with a copy of the index
Steve Cassidy Computing at Macquarie No 12
Relevance• Finding pages with search terms is
easy• Which ones are the best? • Google:
– Text in titles, headings is important– Text earlier in the page is important– Text of links to this page is important– Important pages link to other important
pages
Steve Cassidy Computing at Macquarie No 13
Making the Most of Search Engines• Use words likely to appear in the
pages you want• Use more query terms to narrow
your result• Be brief• Don’t worry about spelling • Use “words in quotes” to search
for phrases
Steve Cassidy Computing at Macquarie No 14
Other Search Engines
• www.teoma.com– Offers ‘refine your search’ – Subject specific popularity
• www.ask.com– Natural language questions
• search.yahoo.com
Steve Cassidy Computing at Macquarie No 15
The Future
• Information Extraction– Find all the details of this conference
for my diary• Question Answering
– When did Armstrong land on the moon?
• The Semantic Web– Exchanging machine readable data
Steve Cassidy Computing at Macquarie No 16
Language Technology• SLP148 Language, Logic and
Computation• COMP248 Language Technology• COMP249 Web Technology• COMP348 Document Processing and the
Semantic Web• COMP349 Spoken Language Dialogue
Systems
Steve Cassidy Computing at Macquarie No 17
Questions?