introduction to search engines

Post on 07-May-2015

8.043 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation gives an introduction to the Search Engines. What are they? How do they work? It also has a brief introduction to Solr and Lucene

TRANSCRIPT

ENTERPRISE SEARCH

an introduction

Web Search

Desktop Search

Enterprise Search

so what is a

Search Engine?

a SOFTWARE

• that builds index on Text

• answers queries using that index

Any search application has two major

components

SEARCH component

INDEXING component - of importance to us developers

(read headache)

- of importance to the users

data

INDEX FILES

is indexed

user

sends search query

receives search results

INDEXING component

SEARCH component

Let’s start with

INDEXING

is it easy to search here . . .

or here . . .

• that’s information like garbage

• no structure

• comes in all kinds of shapes, sizes, formats

• And this is what indexing does

• Makes data accessible in a structured format, easily accessible through search.

so what all needs to be

Indexed and Searched ?

various FILE FORMATS

Text Files

HTMLPDF

MS Word

PPT

coming from various DATA SOURCES

EmailsCMS

File System

Database

Web Pages

data ( documents )

INDEX FILES

user

sends search query

receives search results

Analyzer

fed to

text that should be indexed

removing stop words such as "a" or "the"

converting all text to lowercase letters for case-insensitive searching

Stemming(A stemming algorithm reduces the words "fishing", "fished",

"fish", and "fisher" to the root word, "fish". )-

Index Writer

tokenized text

Document 1:Coffee isn't my cup of tea.

Document 2: Chocolate, men, coffee - some things are better rich.

INDEXcoffee - 1,2cup - 1 tea - 1chocolate - 1men - 1things - 1better - 1rich - 1

And now the

SEARCH Component

data

INDEX FILES

is indexed

user

receives search results

sends search query

search terms

Search Request Terms

Taxonomy

Spelling IndexCorrect Search Terms + Incorrect Search Terms

Search Terms +Related Terms from Taxonomy + Concept IDs

Search engine(INDEX)

Search results with

1) Actual Location of the result2) Rank3) Details4) Facet Categorization

Results’ Page

introducing

LUCENE

Full-text search library

Open Source

Documents in xml format

Can operate on its own or via Solr

Ways of storing fields of any document:

Indexed means it is searchable

Stored you may chose not to make a field searchable, means the content can be displayed in the search results. Example : “summary associated with a page”

Tokenized means it is run through an Analyzer, that converts the

content into a sequence of tokens

introducing

SOLRSolr

Solr

Lucene

Index

• open source

• handles index/Query to Lucene via HTTP and XML ( also JSON )

• manages document update, add and delete requests to Lucene

• straightforward schema and config files

• comprehensive HTML Admin Interfaces

• highly configurable

Adding Documentsto SOLR

HTTP POST to /update

<add><doc boost=“2”>

<field name=“type”>05991</field>

<field name=“from”>Apache Solr</field>

<field name=“subject”>An intro...</field>

<field name=“category”>search</field>

<field name=“category”>lucene</field>

<field name=“body”>Solr is a full...</field>

</doc></add>

Schema.xml field indexing and display definition

Solrconfig.xml file

defines cache size, faceted field type, request handler customization

Deleting Documents• Delete by Id

<delete><id>05591</id></delete>

• Delete by Query (multiple documents)

<delete>

<query>manufacturer:microsoft</query>

</delete>

Search Results

http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price

Default Parameters

param

default description

q The query

start 0 Offset into the list of matches

rows 10 Number of documents to return

fl * Stored fields to return

qt standard Query type; maps to query handler

df (schema) Default field to search

http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price

<response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result></response>

Solr Core

Lucene

AdminInterface

StandardRequestHandler

DisjunctionMax

RequestHandler

CustomRequestHandler

Update Handler

Caching

XMLUpdate Interface

Config

Analysis

HTTP Request Servlet

Concurrency

Update Servlet

XMLResponse

Writer

Replication

Schema

Search Requests hit here New document to be added here

top related