for itcs 6265 professor: wensheng wu present by ta: xu fei

12
Lucene in Action For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Upload: jasper-mccoy

Post on 17-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Lucene in ActionFor ITCS 6265

Professor: Wensheng WuPresent by TA: Xu Fei

Page 2: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

What is Lucene“Apache Lucene is a high-performance, full-

featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ”high performance, scalable Information

Retrieval (IR) library.a project in the Apache Software Foundationmature, free, open-sourceimplemented in Java.

Page 3: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

full-text indexing and searching“In text retrieval, full text search refers to a

technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ”

“Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”

Page 4: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Lucene is populara number of ports or integrations to other

programming languages C/C++, C#, Ruby, Perl, Python, PHP, etc.

1500+ installations: HP, FedEx, Iron Mountain, Akamai, DSpace,

IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….

Page 5: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Lucene is just a hammer!NOT a ready-to-use search application, like

Googlea software library, a toolkita single compact JAR file (less than 1 MB!)A number of full-featured search applications

have been built on top of Lucene.

Page 6: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

What Lucene can do for youadd search capabilities to your applicationindex and make searchable any data that you

can extract text fromLucene doesn’t care about the source of the

data, its format, or even its language, as long as you can derive text from it.

You can even index data stored in your databases, indirectly!

Page 7: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Search Application

Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.

Components for indexingAcquire ContentBuild DocumentAnalyze DocumentIndex Document

Components for searchingSearch User InterfaceBuild QuerySearch QueryRender Results

OthersAdministration InterfaceAnalytics InterfaceScaleout

Page 8: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Ranking formulascore(Q,D)   =   coord(Q,D)  · queryNorm(Q)  

·  ∑ t in Q ( tf(t in D)  ·  idf(t)2  

·  t.getBoost() · norm(D) )

tf–idf weight (term frequency–inverse document frequency)

Page 9: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Key index files in LuceneSegments fileFields information fileText information fileFrequency filePosition file

Page 10: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Inverted Index Example Doc 1:

Penn State Football …

football

Doc 2:

Football players … State

Postingid

word doc offset

1 football Doc 1 3

Doc 1 67

Doc 2 1

2 penn Doc 1 1

3 players Doc 2 2

4 state Doc 1 2

Doc 2 13

PostingTable

Page 11: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

DemoHow to install Lucene and run the demo

Boolean retrieval example apache – lucene apache + lucene apache lucene

Luke: http://www.getopt.org/luke/A online demo (PHP + Lucene) :

http://tiny.cc/JCA9K

Page 12: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Reference:Lucene: http://lucene.apache.org/Apache: http://www.apache.org/“Lucene in Action” Chapter 1 and code: LinkLucene index:

http://www.ibm.com/developerworks/library/wa-lucene /

http://lucene.apache.org/java/2_4_0/scoring.htmlhttp://lucene.apache.org/java/2_4_0/api/org/apache/luc

ene/search/Similarity.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Index_%28search_engine

%29http://en.wikipedia.org/wiki/Tf-idf