a specialised search engine for neuroscience webpages

26
A Specialised A Specialised Search Engine for Search Engine for Neuroscience WebPages Neuroscience WebPages Fatma Y. ELDRESI Fatma Y. ELDRESI (MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis, [email protected] NeuroSearch

Upload: abigail-puckett

Post on 02-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

N euro S earch. A Specialised Search Engine for Neuroscience WebPages. Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis, [email protected]. Contents. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Specialised  Search Engine for Neuroscience WebPages

A Specialised A Specialised Search Engine for Search Engine for

Neuroscience WebPagesNeuroscience WebPages

Fatma Y. ELDRESI Fatma Y. ELDRESI (MPhil )Systems Analysis / Programming Specialist, AGOCO

Part time lecturer in University of Garyounis,

[email protected]

NeuroSearch

Page 2: A Specialised  Search Engine for Neuroscience WebPages

2

Contents Introduction

Implementation

Testing

Software lifecycle : (1)webCrawler Engine, (2) Indexer Engine, (3) Query Engine, (4) Re-Crawler Engine (Specialised Crawler)

Conclusions

Components in a NeuroSearch & its Architecture

Challenges

Page 3: A Specialised  Search Engine for Neuroscience WebPages

3

Introduction

What is a Search

Engine?

A server or a collection of servers dedicated to indexing internet web pages, storing the results and returning lists of pages which match particular queries.

Convenient search engines generate indexes :

•Google using Spider•Yahoo using Directory

“NeuroSearch” Using Spider & the Advance Knowledge

Page 4: A Specialised  Search Engine for Neuroscience WebPages

4

Introduction cont..

Defining the

problem

In addition,(1)- users have many challenges in choosing the relevant keywords;(2)- professionals sometimes fail in their search and get disappointed result,

becauseA. the retrieved pages sometimes not related orB. different from what the they’re looking for.

TheThe Objective

Creating a specialised search engine (i.e, Advance knowledge) to read web documentsIndex and update all the content in the local serverAnswer the queries from the local database Update the system over a constant period

why is a specialised search engine needed? Web has got non centralised organisation, with huge mixed

collection of Information Updated continuously, without standard format, Pages are extensively linked

Therefore,Therefore, establishing standard measures for relevance is a very challenging task establishing standard measures for relevance is a very challenging task

Page 5: A Specialised  Search Engine for Neuroscience WebPages

5

Components of “NeuroSearch”

It has two components:It has two components:1-1-Search/Crawler EngineSearch/Crawler Engine2- 2- Query enginesQuery engines

Page 6: A Specialised  Search Engine for Neuroscience WebPages

6

Components explained

Retriever (Query engine)

Re-crawler

Indexer

Spider

Crawler EngineCrawler Engine

Crawler EngineCrawler Engine

Crawler EngineCrawler Engine

Query EngineQuery Engine

Page 7: A Specialised  Search Engine for Neuroscience WebPages

7

“NeuroSearch” Architecture Model

Search Engine

Interface

Query Engine

Indexer

Index

Re-Crawler WebCrawler

World Wide Web

Users

WWW

Page 8: A Specialised  Search Engine for Neuroscience WebPages

8

Implementation and Case Study

•Creating the database using Access DB.

•Implementing all parts of “NueroSearch” using Java Language and SQL.

Page 9: A Specialised  Search Engine for Neuroscience WebPages

9

NeuroSearch Database

The

Advance

Knowledge

TEXTTEXT TEXT

WebCrawler data

Advance Knowledge data Re-crawler

data

Query Data

Indexer data

Page 10: A Specialised  Search Engine for Neuroscience WebPages

10

The advance knowledge Case study- Neuroscience (Vision)

Ph

ase

1P

has

e 1

Ph

ase

2P

has

e 2

Ph

ase

3P

has

e 3

NeuroSearch uses advance knowledge about Neuroscience (vision) as a case study.

Then, as a domain knowledge of Vision, do data mining to construct keywords and the relation between them.

This knowledge is stored in the database and categorised by numbers, and related knowledge is categorised

too and stored in data network form in the database.

Page 11: A Specialised  Search Engine for Neuroscience WebPages

11

Software lifecycle

Consists of 1. WebCrawler/Spider EngineWebCrawler/Spider Engine 2. 2. Indexer EngineIndexer Engine 3. 3. Re-Crawler (specialised)Re-Crawler (specialised)

Crawler Engine

Page 12: A Specialised  Search Engine for Neuroscience WebPages

12

WebCrawler (Spider)

Spider

1)-This web crawler is general one which can download any kind of WebPages. It performs this using :

3)-In addition, WebCrawlerhas to access the proxyaccess the proxyfirewallfirewall (i.e. in Newcastle University LAN), before downloaded any web sites.

2)-Fetch URL, retrieves all its WebPages and saves them in the local drive

4)-The crawler performs a performs a breadth-first breadth-first searchsearch, which means it collects a list of all the links that are on the current page before

it follows any of the links to a new page.

Page 13: A Specialised  Search Engine for Neuroscience WebPages

13

WebCrawler - real challenge.

Challenge 1:connect to www and accessing private websites.

Solution 1:Crawler has to allow its socket to connect first with the Proxy server.

Challenge 2:connect this socket further to the WWW

Solution 2:Get method : the straight forward socket uses is just to get the file name. However, in this caseGet command has to take the full URL.

Page 14: A Specialised  Search Engine for Neuroscience WebPages

14

Indexer Engine

Indexer Engine

4)-The Ranking Method

1)-Firstly, it search the webpage using it’s advance knowledge. Then, Webpage will be deleted if it is not related to the case study subject.

2)- if it is related to the case study subject (neuroscience) so the indexer will collect the following information from the document:

3)-All keywords it contains, how many times they are repeated, title, contents Then, save them in the database for later display in the query result and do other calculation.

Page 15: A Specialised  Search Engine for Neuroscience WebPages

15

Query Engine

QueryEngine

It has an interface to accept keywords from the user

gives the user 2 choices for either display only the most relevant result, or the whole result which include the related results.

It searches for query keywords in the index database and retrieved the result in html format.

Page 16: A Specialised  Search Engine for Neuroscience WebPages

16

Query Result: This is indeed an edge compared to other convenient search engines

Page 17: A Specialised  Search Engine for Neuroscience WebPages

17

Re-Crawling

Re-Crawling

2-its interface allow the special users decide to continue crawling the website or

cancel it.

1-WebCrawler is specialised of any subject created in the advance knowledge in the database, which will achieve this purpose by reading the URL from the index database using SQL

3-This Part of software aimed to update the index found new link. This is will make search and crawlany “advance knowledge” subject related websites easier

Page 18: A Specialised  Search Engine for Neuroscience WebPages

18

Testing phaseTesting phase

20 tests for each category

Test phase requires:checking the first 10 ranking queries results of the “NeuroSearch” withthe same 10 queries results of another search engine such as Google.

abbreviation abbreviation & combined& combined

keywordskeywords

generalgeneral keywordskeywords

specific specific keywordskeywords

AbbreviationAbbreviation keywordskeywords

combinedcombined keywordskeywords

Total ofTotal of 1000 tests 1000 tests

Page 19: A Specialised  Search Engine for Neuroscience WebPages

19

Testing cont..

Ranking query test results in General Keywords:

Search Engine Google NeuroSearch Search Engine

First 10

results

Rank Keyword Repeated Rank Keyword repeated Related-keyword

repeatedQuality/

percentage

1 0 0 0 10 1 3 53 3 37%

2 10 1 3 10 1 3 51 3 27%

3 0 0 0 10 1 3 37 3 36%

4 0 0 0 10 1 3 37 3 33.6%

5 0 0 0 10 1 3 34 3 36.7%

6 0 0 0 10 1 3 29 3 38.4%

7 0 0 0 10 1 3 28 3 38.1%

8 0 0 0 10 1 3 28 3 38%

9 0 0 0 10 1 3 28 3 24.9%

10 0 0 0 10 1 3 28 3 13.8%

Average %

10% 10% 100% 100%

Table 1: (Query 1) Ranking query test result in General Keywords: (Eye)

Page 20: A Specialised  Search Engine for Neuroscience WebPages

20

Testing cont..The Average Rankinf performance Engine Query test results

(Category based)Error bar = +/- 1 standard deviation

6.33

36.66

1.99

48.99

80.96

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5

Ra

nk

ing

pe

rfo

rm

an

ce

Google

Chart 1 Average of Keywords

performance for Category Based test

results of the (Google)

The Average Keyword Performance Engine Query test results (Category based)

Error bar = +/- 1 standard deviation

92.33 88.49 92.9979.49

98.16

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5

Ra

nk

ing

pe

rfo

rma

nc

e

NeuroSearch

Chart 2 Average of Keywords

performance for Category Based test results of the (NeuroSearch)

Page 21: A Specialised  Search Engine for Neuroscience WebPages

21

Analysing the search engines ranking results Depends on the Categories

Independent Samples T-Test Google Search Engine * NeuroSearch Search Engine

-16.920

.000

9 Statisticallysignificant

-4.394

.000

19 Statisticallysignificant

-63.50

.000

19 Statisticallysignificant

-3.387

.003

19 Statisticallysignificant

-2.904

.009

19 Statisticallysignificant

T-value

Sig. (2-tailed)

df (degree offreedom

T-value

Sig. (2-tailed)

df (degree offreedom

T-value

Sig. (2-tailed)

df (degree offreedom

T-value

Sig. (2-tailed)

df (degree offreedom

T-value

Sig. (2-tailed)

df (degree offreedom

General Keywords

Specific keywords

abbreviationskeywords

combinedkeywords

abbreviations,combined andspecific keywords

GoogleSearchEngine

Generalkeywords

SpecificKeywords

abbreviationskeywords

combinedkeywords

abbreviations,combined and

specifickeywords

NeuroSearch Search Engine

Table 4. The Average Ranking Engines Performance Query test results Category based

Page 22: A Specialised  Search Engine for Neuroscience WebPages

22

Analysing the Average Ranking Engines Performance Query test results Category based

t test Result analysis Result analysis ..

is used to compare two groups' scores on the same variable

p value < .05).

That indicates, NeuroSearch have a statistically significantly higher mean score in all categories ranking results (100) than Google (52.35)

the negative values of t-test show the (inverse) relation between them when NeuroSearch results increase the Google results decrease.

Page 23: A Specialised  Search Engine for Neuroscience WebPages

23

Visual representation

52.35

100

0 10 20 30 40 50 60 70 80 90 100

Ranking Performance

1

Average Ranking Engines performance queries based

Google NeuroSearch

Chart 3 Average of Categories Based Engines ranking performance

90.29

34.98

0102030405060708090

100

Average of Keywords

1

Average Keywords Engines performance queries based

Google NeuroSearch

Chart 4 Average of the keyword Based in the documents in Query test results for (Category based Query) engines performance

Page 24: A Specialised  Search Engine for Neuroscience WebPages

24

Conclusion

Although “Although “NeuroSearch”NeuroSearch”

search engine Used search engine Used

a a simple algorithmsimple algorithm to judge the page to judge the page

quality compared by quality compared by

other convenient search engines,other convenient search engines,

Although “Although “NeuroSearch”NeuroSearch”

search engine Used search engine Used

a a simple algorithmsimple algorithm to judge the page to judge the page

quality compared by quality compared by

other convenient search engines,other convenient search engines,

““NeuroSearch”NeuroSearch” proves to be very proves to be very

powerful in obtaining relevant results,powerful in obtaining relevant results,

““NeuroSearch”NeuroSearch” proves to be very proves to be very

powerful in obtaining relevant results,powerful in obtaining relevant results,

Particularly, if its Particularly, if its advance advance knowledge knowledge built/createdbuilt/created by by specialist (domain specialist (domain knowledge),knowledge),

e.g. Oil, Medical, e.g. Oil, Medical, arts, etcarts, etc

Particularly, if its Particularly, if its advance advance knowledge knowledge built/createdbuilt/created by by specialist (domain specialist (domain knowledge),knowledge),

e.g. Oil, Medical, e.g. Oil, Medical, arts, etcarts, etc

Page 25: A Specialised  Search Engine for Neuroscience WebPages

25

Reference (example..)

: Wandell, Brain A. Foundations of Vision. Sunderland, Massachusetts, USA, 1995.

Brin, S. and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. The Seventh Annual International WWW Conference and computing science of Stanford University, Stanford, CA 94305.USA, 1998.

Page 26: A Specialised  Search Engine for Neuroscience WebPages

26

Ready for Questions!!!Ready for Questions!!!