advanced indexing techniques with apache lucene - payloads advanced indexing techniques with michael...

Post on 14-Dec-2015

231 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques

with

Michael Busch

(buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Agenda

• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads

• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting

Advanced Indexing Techniques with Apache Lucene - Payloads

Lucene’s data structures

InvertedIndex

Store

search

Results

retrieve stored fields

Hits

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Query: not

String comparison slow!

Solution: Inverted index

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Query: notInverted index

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0 1

1

0

0

0 1

0

0

Document IDs

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Inverted index

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0 1

1

0

0

0 1

0

0

0 1 2 3 4 5

0 1 2 3 4 5

6 7

Query: ”not to”

Document IDs

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Query: ”not to”Inverted index

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0

1

0

0

0

0

0

1

0 1 2 3 4 5

0 1 2 3 4 5

6 7

1

1

3

4

2

7

6

5

0

2

5

0 41

Document IDsPositions

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Inverted index with Payloads

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0

1

0

0

0

0

0

0 1 2 3 4 5

0 1 2 3 4 5

6 7

1

1

3

4

2

7

6

5

0

2

0

1

5

1

Document IDsPositions Payloads

4

Advanced Indexing Techniques with Apache Lucene - Payloads

So far…

• String comparison slow

• Inverted index used to accelerate search

• Store positions in posting lists to allow phrase searches

• Store payloads in posting lists to store arbitrary data with each position

Advanced Indexing Techniques with Apache Lucene - Payloads

Lucene’s data structures

InvertedIndex

Store

search

Results

retrieve stored fields

Hits

Advanced Indexing Techniques with Apache Lucene - Payloads

Store

StoreField 1: titleField 2: contentField 3: hashvalue

Documents:

F3D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3

Advanced Indexing Techniques with Apache Lucene - Payloads

F3

Store

D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3

• Optimized for random access

• Document-locality

Advanced Indexing Techniques with Apache Lucene - Payloads

F3

Store

D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3

• Optimized for scanning and skipping

• Value-locality

Posting list with Payloads

D0 D1 D1F30 0 0F3 F3Document IDsPositions Payloads

XXX

Advanced Indexing Techniques with Apache Lucene - Payloads

Agenda

• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads

• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting

Advanced Indexing Techniques with Apache Lucene - Payloads

org.apache.lucene.analysis.Token

void setPayload(Payload payload)

org.apache.lucene.index.TermPositions

int getPayloadLength();byte[] getPayload(byte[] data, int offset)

Payloads - API

Advanced Indexing Techniques with Apache Lucene - Payloads

Analyzer:

final byte BoldBoost = 5;…Token token = new Token(…);…If (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost}));}…return token;

Example: BoostingTermQuery

Advanced Indexing Techniques with Apache Lucene - Payloads

Similarity:Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; };

Example: BoostingTermQuery

Advanced Indexing Techniques with Apache Lucene - Payloads

Example: BoostingTermQuery

BoostingTermQuery:

Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”));

Searching:

Searcher searcher = new IndexSearcher(…);Searcher.setSimilarity(boostingSimilarity);…Hits hits = searcher.search(btq);

Advanced Indexing Techniques with Apache Lucene - Payloads

Analyzer:

public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token;}}}}

Example: Simple facet counting

Advanced Indexing Techniques with Apache Lucene - Payloads

Hitcollector:

Example: Simple facet counting

• Use different PriorityQueues for different sites

• Instead of returning top-n results of the whole data set, return top-n results per site

Advanced Indexing Techniques with Apache Lucene - Payloads

Summary

Example: Simple facet counting

• In this example: facet (site) used for scoring, but extendable for facet counting

• Good performance due to locality of facet values

Advanced Indexing Techniques with Apache Lucene - Payloads

Conclusion

• Payloads offer great flexibility

• Payloads are stored very space-efficient

• Sophisticated data structures enable efficient skipping over payloads

• Payloads should be used whenever special data is required for finding hits and scoring

Advanced Indexing Techniques with Apache Lucene - Payloads

Outlook

• Finalize API (currently Beta)

• Add more out-of-the-box query types

• Per-document Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques

with

Questions ?

top related