kd-2014-optimizing-document-search-using-lucene

22
Document Search Optimization using Lucene API Harsha Ummerpillai ([email protected]) Shyam Gedela ([email protected]) Michigan State University 11/13/2014 1

Upload: harshakumar-ummerpillai

Post on 07-Aug-2015

17 views

Category:

Documents


4 download

TRANSCRIPT

Document Search Optimization using Lucene API

Harsha Ummerpillai ([email protected])

Shyam Gedela ([email protected])

Michigan State University

11/13/2014

1

2

During KD 2013 we presented MSU’s approach to optimizing and improving performance of Rice Document search using Lucene API. We are back this year to share the lessons learned and results of our implementation.

Background

Kuali Days 2014Indianapolis 3

Introduction

• Background• Goals for Lucene implementation• Technical Recap• Implementation• Performance Results• How to• Demo

Kuali Days 2014Indianapolis 4

Background

• Document Search - Why is it important• MSU implementation

– Go Live - Jan 1 2011• Rice 2.1.9• KFS 5.0.x• KMM 1.x• OOI 1.x

– ~4 yrs of operation– ~4 million documents– ~50 million search attributes

Kuali Days 2014Indianapolis 5

Goals

• Goals for Lucene implementation– Fast – Improved and consistent response

times– Configurable – can be enabled/disabled

using configuration– Seamless – No change to user screen– Scalable– Customizable

6

Technical Details Recap

Kuali Days 2014Indianapolis 7

Document search

• Client applications define searchable attributes in Data Dictionary.

• Rice extracts and builds index, saving key value pairs into DB.

• Attributes saved into 4 different tables based on data types.• Existing structure – 1 document to n indexed records• Standard searchable fields

– Status codes– Initiator– Approver – Action dates.

• Custom attributes defined by document types

Kuali Days 2014Indianapolis 8

Document search

Kuali Days 2014Indianapolis 9

Rice Doc Search

Kuali Days 2014Indianapolis 10

Rice Doc Index w Lucene

Kuali Days 2014Indianapolis 11

Rice Doc Index w Lucene

12

Key aspects of the implementation

Implementation

Kuali Days 2014Indianapolis 13

Technical features

• Documents are queued for Lucene indexing with four separate stages

– WAIT_FOR_REALTIME("0"), – READY_FOR_REALTIME("1"), – WAIT_FOR_MASTER("2"), – READY_FOR_MASTER("3")

• Two indexes; master and real-time• Master refreshed 3 times a day• Real-time index refreshed every 5 seconds• Single master node writes index to shared file storage

Kuali Days 2014Indianapolis 14

Auto warming

Index Readers are auto warmed and queued in all nodes

Kuali Days 2014Indianapolis 15

Index Storage

• Directory structure within Lucene Index store• temp: Storage location before merge into active index• meta-info: Index stats and message files

16

Results

Kuali Days 2014Indianapolis 17

Performance Test Scenarios

• 7 business scenarios• Invaluable for daily operations

– E.g. how many payment requests are department approved but have not been extracted by PDP (Vendors not paid)

Kuali Days 2014Indianapolis 18

Performance Charts - Comparison

ACCT Approver PO REQS PCDO PREQ CM0

50000

100000

150000

200000

250000

300000

350000

No Lucene

Lucene

19

We have created an open contribution JIRA CONTRIB-95 and happy to provide latest fixes and patches from our production.

How To

Kuali Days 2014Indianapolis 20

How to guide

• Visit contribution JIRA https://jira.kuali.org/browse/CONTRIB-95• Download and apply the patch file to rice (base version - 2.1.9)

workspace.• Add Lucene configuration properties to rice application configuration file.• Setup shared file store location where index will be saved and shared.• Add Lucene index queue table using lucene-setup.sql• Build and start rice application with Lucene configuration enabled• Visit “Administration > Lucene Administration “ click “Build Master Index”• Click refresh link to see the status, when index.ready file is listed master

index is ready for use.• Create a document and see if it is available in search, if real time indexer

is working new document should appear in search results within 5~10 seconds.

• Use administration page to see the latest status and manage the index.

21

Technical – Admin page

Kuali Days 2014Indianapolis 22

References

• Lucene http://lucene.apache.org/core/

• KD 2013 Presentation https://jira.kuali.org/secure/attachment/77886/KD-2013-Optimizing-Document-Search-using-Lucene.pptx

• CONTRIB-95 https://jira.kuali.org/browse/CONTRIB-95