language support and linguistics in lucene solr & its eco system

150
Language support and linguistics in Apache Lucene™ and Apache Solr™ and the eco-system Gaute Lambertsen Christian Moen [email protected] [email protected]

Upload: lucenerevolution

Post on 24-May-2015

4.558 views

Category:

Education


2 download

DESCRIPTION

Presented by Christian Moen, Software Engineer, Atilika Inc. In search, language handling is often key to getting a good search experience. This talk gives an overview of language handling and linguistics functionality in Lucene/Solr and best-practices for using them to handle Western, Asian and multi-language deployments. Pointers and references within the open source and commercial eco-systems for more advanced linguistics and their applications are also discussed. The presentation is mix of overview and hands-on best-practices the audience can benefit immediately from in their Lucene/Solr deployments. The eco-system part is meant to inspire how more advanced functionality can be developed by means of the available open source technologies within the Apache eco-system (predominantly) while also highlighting some of the commercial options available.

TRANSCRIPT

Page 1: Language support and linguistics in lucene solr & its eco system

Language support and linguisticsin Apache Lucene™ and Apache Solr™ and the eco-system

Gaute Lambertsen Christian [email protected] [email protected]

Page 2: Language support and linguistics in lucene solr & its eco system

Christian Moen• MSc. in computer science, University of Oslo, Norway

• Worked with search at FAST (now Microsoft) for 10 years• 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway• 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan

• Founded アティリカ株式会社 in October, 2009• We help companies innovate using new technologies and good ideas• We do information retrieval, natural language processing and big data• We are based in Tokyo, but we have clients everywhere• We are a small company, but our customers are typically very big companies

• Newbie Lucene & Solr Committer• Mostly been working on Japanese language support (Kuromoji) so far

• Please write me on [email protected] or [email protected]

Page 3: Language support and linguistics in lucene solr & its eco system

Gaute Lambertsen• MSc. in computer science, Ritsumeikan University

• Japan Science and Technology Agency (JST)• Urban ad-hoc network research lead by Prof. Nobuhiko Nishio,

Ritsumeikan University

• Nokia Research & Development, Japan• Hardware and software prototypes for research applications.

• Sony Digital Network Applications, Japan• Various software development for consumer electronics• Includes software for consumer hits Sony CyberShot, etc.

• Joined Atilika in April, 2012

• Please write me on [email protected]

Page 4: Language support and linguistics in lucene solr & its eco system

About us• We are a small company based in Tokyo• Our customers are typically big companies “everywhere”• We focus on innovation - and help big companies innovate using new technologies and new ideas

• We are software engineers, but like business stuff, too• We know a bit about search, NLP and big data• We offer consulting services and software products• We are supporters of open source software • We are hiring! Write us using [email protected]

Page 5: Language support and linguistics in lucene solr & its eco system

About this talk• Search engine review• Natural language processing and search

• Examples from different languages• Measuring search quality• Linguistics in Lucene and Solr

• Architecture overview• Lucene Analyzers• Adding and searching documents in Solr• Code examples

• NLP eco-system• The bigger picture

Page 6: Language support and linguistics in lucene solr & its eco system

Hands-on 1: Working with analyzers in code

Hands-on 4: Text processing using OpenNLP

Hands-on 3: Multi-lingual search with Solr

Hands-on 2: French analysis with synonyms

Page 7: Language support and linguistics in lucene solr & its eco system

What is a search engine?

Page 8: Language support and linguistics in lucene solr & its eco system

Documents1 Sushi is very tasty in Japan

2 Visiting the Tsukiji fish market is very funTwo documents (1 & 2) with English text

1

Page 9: Language support and linguistics in lucene solr & its eco system

Text segmentation

1 Sushi is very tasty in Japan

2 Visiting the Tsukiji fish market is very fun

1 Sushi is very tasty in Japan

2 Visiting the Tsukiji fish market is very fun

Documents are turned into searchable terms (tokenization)

Two documents (1 & 2) with English text

1

2

Page 10: Language support and linguistics in lucene solr & its eco system

Text segmentation

1 Sushi is very tasty in Japan

2 Visiting the Tsukiji fish market is very fun

1 Sushi is very tasty in Japan

2 Visiting the Tsukiji fish market is very fun

1 sushi is very tasty in japan

2 visiting the tsukiji fish market is very fun

Documents are turned into searchable terms (tokenization)

Two documents (1 & 2) with English text

Terms/tokens are converted to lowercase form (normalization)

1

2

3

Page 11: Language support and linguistics in lucene solr & its eco system

Document indexing1 sushi is very tasty in japan

2 visiting the tsukiji fish market is very funTokenized documents with normalized tokens

Page 12: Language support and linguistics in lucene solr & its eco system

Document indexing

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

1 sushi is very tasty in japan

2 visiting the tsukiji fish market is very funTokenized documents with normalized tokens

Inverted index - tokens are mapped to the document ids that contain them

Page 13: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

Page 14: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

query

very tasty sushi

Page 15: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvery tasty sushi

parsed query

Page 16: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvery tasty sushi

parsed query

Page 17: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvery tasty sushi

parsed query

Page 18: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvery tasty sushi

parsed query

Page 19: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

1hits

ANDvery tasty sushi

parsed query

Page 20: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

Page 21: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

query

visit fun market

Page 22: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvisit fun market

parsed query

Page 23: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvisit fun market

parsed query

Page 24: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvisit fun market

parsed query

Page 25: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvisit fun market

parsed query

visit ≠ visiting

Page 26: Language support and linguistics in lucene solr & its eco system

Searching

sushi 1

is 1 2very 1 2

tasty 1

in 1

japan 1visiting 2

the 2

tsukiji 2

fish 2market 2

fun 2

ANDvisit fun market

parsed query no hits

(all terms need to match)

Page 27: Language support and linguistics in lucene solr & its eco system

What’s the problem?

Search engines are notmagical answering machines

They match terms in queriesagainst terms in documents,and order matches by rank

!

!

Page 28: Language support and linguistics in lucene solr & its eco system

Key takeawaysText processing affects search quality in

big way because it affects matching

The “magic” of a search engine is often provided by high quality text processing

Garbage in ⇒ Garbage out

!

!

Page 29: Language support and linguistics in lucene solr & its eco system

Natural language and search

Page 30: Language support and linguistics in lucene solr & its eco system

日本語English Deutsch Français العربيةى

Page 31: Language support and linguistics in lucene solr & its eco system

EnglishPale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

Page 32: Language support and linguistics in lucene solr & its eco system

English

How do we want to index world's??

Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

Page 33: Language support and linguistics in lucene solr & its eco system

English

How do we want to index world's??

Should a search for style match styles?And should ferment match fermentation?

?

Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

Page 34: Language support and linguistics in lucene solr & its eco system

GermanDas Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.

Page 35: Language support and linguistics in lucene solr & its eco system

GermanDas Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich.

Page 36: Language support and linguistics in lucene solr & its eco system

GermanDas Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich.

How do we want to search ü, ö and ß??

Page 37: Language support and linguistics in lucene solr & its eco system

GermanDas Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich.

How do we want to search ü, ö and ß??

Do we want a search for hauptstadt to match Landeshauptstadt?

?

Page 38: Language support and linguistics in lucene solr & its eco system

FrenchLe champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

Page 39: Language support and linguistics in lucene solr & its eco system

FrenchLe champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.Champagne is a French sparkling wine with a protected designation of origin.

Page 40: Language support and linguistics in lucene solr & its eco system

FrenchLe champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

How do we want to search é, ç and ô??

Champagne is a French sparkling wine with a protected designation of origin.

Page 41: Language support and linguistics in lucene solr & its eco system

FrenchLe champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

How do we want to search é, ç and ô??

How do we want to search d'origine??

Champagne is a French sparkling wine with a protected designation of origin.

Page 42: Language support and linguistics in lucene solr & its eco system

Arabicتعـتـبر القهوة العربي?ه ا=صـــــــيلة رمزا من رموز الكرم عـند

العرب فى العالم العربي.

Page 43: Language support and linguistics in lucene solr & its eco system

Arabicتعـتـبر القهوة العربي?ه ا=صـــــــيلة رمزا من رموز الكرم عـند

العرب فى العالم العربي.

Reads from right to left

Page 44: Language support and linguistics in lucene solr & its eco system

Arabicتعـتـبر القهوة العربي?ه ا=صـــــــيلة رمزا من رموز الكرم عـند

العرب فى العالم العربي.

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

Page 45: Language support and linguistics in lucene solr & its eco system

Arabicتعـتـبر القهوة العربي?ه ا=صـــــــيلة رمزا من رموز الكرم عـند

العرب فى العالم العربي.

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ا=صـــــــيلة??

Page 46: Language support and linguistics in lucene solr & its eco system

Arabicتعـتـبر القهوة العربي?ه ا=صـــــــيلة رمزا من رموز الكرم عـند

العرب فى العالم العربي.

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ا=صـــــــيلة??

Do we want to normalize diacritics??

Page 47: Language support and linguistics in lucene solr & its eco system

Arabicتعتبر القهوة العربية ا=صـــــــيلة رمزا من رموز الكرم عند

العرب فى العالم العربي.

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ا=صـــــــيلة??

Do we want to normalize diacritics??

Diacritics normalized (removed)

Page 48: Language support and linguistics in lucene solr & its eco system

Arabicتعـتـبر القهوة العربي?ه ا=صـــــــيلة رمزا من رموز الكرم عـند

العرب فى العالم العربي.

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ا=صـــــــيلة??

Do we want to correct the common spelling mistake for فى and ه?

?

Do we want to normalize diacritics??

Page 49: Language support and linguistics in lucene solr & its eco system

Japanese

JR新宿駅の近くにビールを飲みに行こうか?

Page 50: Language support and linguistics in lucene solr & its eco system

Japanese

Shall we go for a beer near JR Shinjuku station?JR新宿駅の近くにビールを飲みに行こうか?

Page 51: Language support and linguistics in lucene solr & its eco system

Japanese

Shall we go for a beer near JR Shinjuku station?

What are the words in this sentence?? What are the words in this sentence?Which tokens do we index?

JR新宿駅の近くにビールを飲みに行こうか?

Page 52: Language support and linguistics in lucene solr & its eco system

Japanese

Shall we go for a beer near JR Shinjuku station?

What are the words in this sentence??

Words are implicit in Japanese - there is no white space that separates them

!

What are the words in this sentence?Which tokens do we index?

JR新宿駅の近くにビールを飲みに行こうか?

Page 53: Language support and linguistics in lucene solr & its eco system

Japanese

JR新宿駅の近くにビールを飲みに行こうか?

What are the words in this sentence??

Words are implicit in Japanese - there is no white space that separates them

!

What are the words in this sentence?Which tokens do we index?

Shall we go for a beer near JR Shinjuku station?

But how do we find the tokens??

JR新宿駅の近くにビールを飲みに行こうか?

Page 54: Language support and linguistics in lucene solr & its eco system

Japanese

JR新宿駅の近くにビールを飲みに行こうか?

What are the words in this sentence??

Words are implicit in Japanese - there is no white space that separates them

!

What are the words in this sentence?Which tokens do we index?

Shall we go for a beer near JR Shinjuku station?

But how do we find the tokens??

Page 55: Language support and linguistics in lucene solr & its eco system

Japanese

Do we want 飲む (to drink) to match 飲み??

Shall we go for a beer near JR Shinjuku station?JR新宿駅の近くにビールを飲みに行こうか?

Page 56: Language support and linguistics in lucene solr & its eco system

Japanese

Do we want 飲む (to drink) to match 飲み??

Do we want ビール to match ビール??

Shall we go for a beer near JR Shinjuku station?

Does half-width match full-width?

JR新宿駅の近くにビールを飲みに行こうか?

Page 57: Language support and linguistics in lucene solr & its eco system

Japanese

Do we want 飲む (to drink) to match 飲み??

Do we want ビール to match ビール??

Do we want (emoji) to match??

Shall we go for a beer near JR Shinjuku station?

Does half-width match full-width?

JR新宿駅の近くにビールを飲みに行こうか?

Page 58: Language support and linguistics in lucene solr & its eco system

Common traits•Segmenting source text into tokens

• Dealing with non-space separated languages• Handling punctuation in space separated languages• Segmenting compounds into their parts

• Apply relevant linguistic normalizations• Character normalization• Morphological (or grammatical) normalizations• Spelling variations• Synonyms and stopwords

Page 59: Language support and linguistics in lucene solr & its eco system

Key take-aways•Natural language is very complex

• Each language is different with its own set of complexities• We have had a high level look at languages

• But there is also...

• Search needs per-language processing• Many considerations to be made (often application-specific)

Greek Hebrew ChineseKoreanRussian ThaiSpanish and many more...

JapaneseEnglish German French Arabic

Page 60: Language support and linguistics in lucene solr & its eco system

Search quality measurements

Page 61: Language support and linguistics in lucene solr & its eco system

Precision

Fraction of retrieveddocuments that are relevant

precision = | { relevant docs } ∩ { retrieved docs } |

| { retrieved docs } |

Page 62: Language support and linguistics in lucene solr & its eco system

Perfect precisionJust return a single relevant document!

precision = | { relevant docs } ∩ { document A } |

| { document A } |

precision = | { document A } |

| { document A } |= 1

Page 63: Language support and linguistics in lucene solr & its eco system

Recall

| { relevant docs } ∩ { retrieved docs } |

| { relevant docs } |recall =

Fraction of relevantdocuments that are retrieved

Page 64: Language support and linguistics in lucene solr & its eco system

Perfect recall

| { relevant docs } ∩ { all docs docs } |

| { relevant docs } |recall =

Just return everything!

recall = | { relevant docs docs } |

| { relevant docs docs } |= 1

Page 65: Language support and linguistics in lucene solr & its eco system

F =

Accuracy - F

precision ・ recall

A balanced measure of precision and recall

(harmonic mean)

precision + recall

Page 66: Language support and linguistics in lucene solr & its eco system

Precision vs. RecallShould I optimize for precision or recall??

Page 67: Language support and linguistics in lucene solr & its eco system

Precision vs. RecallShould I optimize for precision or recall??

That depends on your application!

Page 68: Language support and linguistics in lucene solr & its eco system

Optimizing for PrecisionOptimize for Precision if hits are plentiful and several results can meet the user’s information needs.

Make sure people can find a few good relevant hits rather than overwhelming them with likely irrelevant results

!

!

Web search, knowledge management!

Page 69: Language support and linguistics in lucene solr & its eco system

Optimizing for RecallOptimize for Recall if not missing any relevant hits is your top priority

Return everything that is relevant, but probably also return a lot of documents that are not

!

!

Compliance, patent search, digital forensics, etc.

!

Page 70: Language support and linguistics in lucene solr & its eco system

A lot of tuning work is in practice often about improving recall without hurting precision

!

Page 71: Language support and linguistics in lucene solr & its eco system

Linguistics in Lucene and Solr

Page 72: Language support and linguistics in lucene solr & its eco system

Core search & indexing library Ready-to-use search serverInterface is an API (usually Java) Interface is an HTTP serverNo configuration files Configuration filesNo schema Has a schemaRuns on a single server Runs on multiple servers in a

scalable and fault-tolerant fashionHas linguistics support for a range of languages

Adds functionality for facets, spell- checking, highlighting, etc.

Great for embedding search Uses Lucene for search & indexing

Page 73: Language support and linguistics in lucene solr & its eco system

Linguistics in Lucene

Page 74: Language support and linguistics in lucene solr & its eco system

Simplified architecture

Indexdocument

or query

Page 75: Language support and linguistics in lucene solr & its eco system

Indexdocument

or query

Lucene analysis chain / Analyzer1. Analyzes queries or documents in a pipelined fashion before indexing or search2. Analysis itself is done by an analyzer on a per field basis3. Key plug-in point for linguistics in Lucene

Simplified architecture

Page 76: Language support and linguistics in lucene solr & its eco system

What does an Analyzer do??

Analyzers

Page 77: Language support and linguistics in lucene solr & its eco system

What does an Analyzer do??

! Analyzers take text as its input andturns it into a stream of tokens

Analyzers

Page 78: Language support and linguistics in lucene solr & its eco system

What does an Analyzer do??

! Analyzers take text as its input andturns it into a stream of tokens

Tokens are produced by a Tokenizer!

Analyzers

Page 79: Language support and linguistics in lucene solr & its eco system

What does an Analyzer do??

! Analyzers take text as its input andturns it into a stream of tokens

Tokens are produced by a Tokenizer

Tokens can be processed further by achain of TokenFilters downstream

!

!

Analyzers

Page 80: Language support and linguistics in lucene solr & its eco system

Analyzer high-level concepts

Tokenizer

Reader

TokenFilter

TokenFilter

TokenFilter

Reader• Stream to be analyzed is provided by a Reader (from java.io)• Can have chain of associated CharFilters (not discussed)

Tokenizer• Segments text provider by reader into tokens• Most interesting things happen in incrementToken() method

TokenFilter• Updates, mutates or enriches tokens• Most interesting things happen in incrementToken() method

TokenFilter...

TokenFilter...

Page 81: Language support and linguistics in lucene solr & its eco system

Lucene processing example

Le champagne est protégé par une appellation d'origine contrôlée.

Page 82: Language support and linguistics in lucene solr & its eco system

Le champagne est protégé par une appellation d'origine contrôlée.

FrenchAnalyzer

Page 83: Language support and linguistics in lucene solr & its eco system

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée.

FrenchAnalyzer

Page 84: Language support and linguistics in lucene solr & its eco system

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée.

Le champagne est protégé par une appellation d'origine contrôlée

FrenchAnalyzer

Page 85: Language support and linguistics in lucene solr & its eco system

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée.

Le champagne est protégé par une appellation d'origine contrôlée

ElisionFilter

FrenchAnalyzer

Page 86: Language support and linguistics in lucene solr & its eco system

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée.

Le champagne est protégé par une appellation d'origine contrôlée

ElisionFilter

Le champagne est protégé par une appellation origine contrôlée

FrenchAnalyzer

Page 87: Language support and linguistics in lucene solr & its eco system

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée.

Le champagne est protégé par une appellation d'origine contrôlée

ElisionFilter

Le champagne est protégé par une appellation origine contrôlée

FrenchAnalyzer

LowerCaseFilter

Page 88: Language support and linguistics in lucene solr & its eco system

LowerCaseFilter

le champagne est protégé par une appellation origine contrôlée

Page 89: Language support and linguistics in lucene solr & its eco system

LowerCaseFilter

le champagne est protégé par une appellation origine contrôlée

StopFilter

Page 90: Language support and linguistics in lucene solr & its eco system

LowerCaseFilter

le champagne est protégé par une appellation origine contrôlée

StopFilter

champagne protégé appellation origine contrôlée

Page 91: Language support and linguistics in lucene solr & its eco system

LowerCaseFilter

le champagne est protégé par une appellation origine contrôlée

StopFilter

champagne protégé appellation origine contrôlée

FrenchLightStemFilter

Page 92: Language support and linguistics in lucene solr & its eco system

LowerCaseFilter

le champagne est protégé par une appellation origine contrôlée

StopFilter

champagne protégé appellation origine contrôlée

champagn proteg apel origin control

FrenchLightStemFilter

Page 93: Language support and linguistics in lucene solr & its eco system

FrenchAnalyzer

champagn proteg apel origin control

Le champagne est protégé par une appellation d'origine contrôlée.

FrenchLightStemFilter

StandardTokenizer

ElisionFilter

LowerCaseFilter

StopFilter

Page 94: Language support and linguistics in lucene solr & its eco system

FrenchAnalyzerStandardTokenizer

Very commonly used tokenizer• Is smart and handles punctuation cleverly• Used by many space-based languages• Based on a JFlex grammer

ElisionFilterRemoves elisions from tokens

See solr/collection1/conf/lang/contractions_fr.txt

LowerCaseFilter Lowercases tokens

StopFilterRemoved very common words

See solr/collection1/conf/lang/stopwords_fr.txt

FrenchLightStemmerStems French words

Also does accent normalization (removal)

Page 95: Language support and linguistics in lucene solr & its eco system

Analyzer processing model•Analyzers provide a TokenStream

• Retrieve it by calling tokenStream(field, reader)• tokenStream() bundles together tokenizers and any additional filters necessary for analysis

•Input is advanced by incrementToken()• Information about the token itself is provided by so-called TokenAttributes attached to the stream

• Attribute for term text, offset, token type, etc.• TokenAttributes are updated on incrementToken()

Page 96: Language support and linguistics in lucene solr & its eco system

Hands-on: Working with analyzers in code

Page 97: Language support and linguistics in lucene solr & its eco system

Synonyms

Page 98: Language support and linguistics in lucene solr & its eco system

Synonyms•Synonyms are flexible and easy-to-use

• Very powerful tools for improving recall

•Two types of synonyms• One way/mapping “sparkling wine => champagne”• Two way/equivalence “aoc, appellation d'origine contrôlée”

•Can be applied index-time or query-time• Apply synonyms on one side - not both

•Best practice is to apply synonyms query-side• Allows for updating synonyms without reindexing• Allows for turning synonyms on and off easily

Page 99: Language support and linguistics in lucene solr & its eco system

Hands-on: French analysis with synonyms

Page 100: Language support and linguistics in lucene solr & its eco system

Linguistics in Solr

Page 101: Language support and linguistics in lucene solr & its eco system

Adding document details

Index

<add> <doc> <field> ∙∙∙ </field> </doc></add>

Page 102: Language support and linguistics in lucene solr & its eco system

Index

<add> <doc> <field> ∙∙∙ </field> </doc></add>

Adding document details

Page 103: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

<add> <doc> <field> ∙∙∙ </field> </doc></add>

UpdateRequestHandler handles request1. Receives a document via HTTP in XML (or JSON, CSV, ...)2. Converts document to a SolrInputDocument3. Activates the update chain

Adding document details

Page 104: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

UpdateRequestHandler handles request1. Receives a document via HTTP in XML (or JSON, CSV, ...)2. Converts document to a SolrInputDocument3. Activates the update chain

<add> <doc> <field> ∙∙∙ </field> </doc></add>

Adding document details

Page 105: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired

Adding document details

Page 106: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired

Adding document details

Page 107: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired

Adding document details

Page 108: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired

Adding document details

Page 109: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

Update chain of UpdateRequestProcessors1. Update processor added a category field by analyzing body2. Finish by calling RunUpdateProcessor (usually)

Adding document details

Page 110: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

Update chain of UpdateRequestProcessors1. Update processor added a category fields by analyzing body2. Finish by calling RunUpdateProcessor (usually)

Adding document details

Page 111: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

Update chain of UpdateRequestProcessors1. Update processor added a category fields by analyzing body2. Finish by calling RunUpdateProcessor (usually)

id ...

title ...

body ...

genre ...

Adding document details

Page 112: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

id ...

title ...

body ...

genre ...

Lucene analyzer chain1. Fields are analyzed individually

Adding document details

Page 113: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

id ...

title ...

body ...

genre ...

Lucene analyzer chain1. No analysis on id

Adding document details

Page 114: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

body ...

genre ...

Lucene analyzer chain1. Field title being processed

id ...

Adding document details

Page 115: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

body ...

genre ...

Lucene analyzer chain1. Field title being processed

id ...

Adding document details

Page 116: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

body ...

genre ...

Lucene analyzer chain1. Field title being processed

id ...

Adding document details

Page 117: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

body ...

genre ...

Lucene analyzer chain1. Field title being processed

id ...

Adding document details

Page 118: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

body ...

genre ...

Lucene analyzer chain1. Field body being processed

id ...

Adding document details

Page 119: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

genre ...

Lucene analyzer chain1. Field body being processed

id ...

body ...

Adding document details

Page 120: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

genre ...

Lucene analyzer chain1. Field body being processed

id ...

body ...

Adding document details

Page 121: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

genre ...

Lucene analyzer chain1. Field body being processed

id ...

body ...

Adding document details

Page 122: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

genre ...

Lucene analyzer chain1. Field genre being processed2. User a different analyzer chain

id ...

body ...

Adding document details

Page 123: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

title ...

Lucene analyzer chain1. Field genre being processed2. User a different analyzer chain

id ...

body ...

genre ...

Adding document details

Page 124: Language support and linguistics in lucene solr & its eco system

id ...

title ...

body ...

genre ...

Index

Lucene analyzer chain1. All fields analyzed

Adding document details

Page 125: Language support and linguistics in lucene solr & its eco system

Index

id ...

title ...

body ...

genre ...

Adding document details

Page 126: Language support and linguistics in lucene solr & its eco system

Indexquery

Search details

Page 127: Language support and linguistics in lucene solr & its eco system

SearchHandler

Indexquery

Search details

Page 128: Language support and linguistics in lucene solr & its eco system

Indexquery

Search components

Search details

Page 129: Language support and linguistics in lucene solr & its eco system

Analysis chain

Indexquery

Search details

Page 130: Language support and linguistics in lucene solr & its eco system

Indexquery

Search details

Page 131: Language support and linguistics in lucene solr & its eco system

Indexquery

Search components

Search details

Page 132: Language support and linguistics in lucene solr & its eco system

Indexresult

SearchHandler

Search details

Page 133: Language support and linguistics in lucene solr & its eco system

Hands-on: Multi-lingual search with Solr

Page 134: Language support and linguistics in lucene solr & its eco system

Languages out-of-the-box in Solr

Page 135: Language support and linguistics in lucene solr & its eco system

Field types in schema.xml• text_ar Arabic

• text_bg Bulgarian

• text_ca Catalan

• text_cjk CJK

• text_cz Czech

• text_da Danish

• text_de German

• text_el Greek

• text_es Spanish

• text_eu Basque

• text_fa Farsi

• text_fi Finnish

• text_fr French

• text_ga Irish

• text_gl Galician

• text_hi Hindi

• text_hu Hungarian

• text_hy Armenian

• text_id Indonedian

• text_it Italian

• text_lv Latvian

• text_nl Dutch

• text_no Norwegian

• text_pt Portuguese

• text_ro Romanian

• text_ru Russian

• text_sv Swedish

• text_th Thai

• text_fr Turkish

Page 136: Language support and linguistics in lucene solr & its eco system

Field types in schema.xml

Coming soon!LUCENE-4956

• text_ar Arabic

• text_bg Bulgarian

• text_ca Catalan

• text_cjk CJK

• text_cz Czech

• text_da Danish

• text_de German

• text_el Greek

• text_es Spanish

• text_eu Basque

• text_fa Farsi

• text_fi Finnish

• text_fr French

• text_ga Irish

• text_gl Galician

• text_hi Hindi

• text_hu Hungarian

• text_hy Armenian

• text_id Indonedian

• text_it Italian

• text_lv Latvian

• text_nl Dutch

• text_no Norwegian

• text_pt Portuguese

• text_ro Romanian

• text_ru Russian

• text_sv Swedish

• text_th Thai

• text_fr Turkish

• text_kr Korean

Page 137: Language support and linguistics in lucene solr & its eco system

NLP eco-system

Page 138: Language support and linguistics in lucene solr & its eco system

Basis Technology• High-end provider of text analytics software• Rosette Linguistics Platform (RLP) highlights

• Language and encoding identification(55 languages and 45 encodings)

• Segmentation for Chinese, Japanese and Korean• De-compounding for German, Dutch, Korean, etc.• Lemmatization for a range of languages• Part-of-speech tagging for a range of language• Sentence boundary detection• Named entity extraction• Name indexing, transliteration and matching

• Integrates well with Lucene/Solr

Page 139: Language support and linguistics in lucene solr & its eco system

Apache OpenNLP• Machine learning toolkit for NLP

• Implements a range of common and best-practice algorithms• Very easy-to-use tools and APIs

• Features and applications• Tokenization• Sentence segmentation• Part-of-speech tagging• Named entity recognition• Chunking

• Licensing terms• Code itself has an Apache License 2.0• Some models are available, but licensing terms and F-scores are unclear...

• See LUCENE-2899 for OpenNLP a Lucene Analyzer (work-in-progress)

Page 140: Language support and linguistics in lucene solr & its eco system

Hands-on: Basic text processing with OpenNLP

Page 141: Language support and linguistics in lucene solr & its eco system

Other eco-system options

Page 142: Language support and linguistics in lucene solr & its eco system

Summary

Page 143: Language support and linguistics in lucene solr & its eco system

Summary•Getting languages right is hard

• Linguistics helps improves search quality

•Linguistics in Lucene and Solr• A wide range of languages supported out-of-the-box• Considerations to be made on indexing and query side• Lucene Analyzers work on a per-field level• Solr UpdateRequestProcessors work on the document level• Solr has functionality for automatically detecting language

•Linguistics options also available in the eco-system

Page 144: Language support and linguistics in lucene solr & its eco system

Linguistics in the bigger picture

Page 145: Language support and linguistics in lucene solr & its eco system

Linguistics in the bigger picture• Understand your content and your users’ needs

• Understand your language and its issues• Understand what users want from search

• Do you have issues with recall?• Consider synonyms, stemming• Consider WordDelimiterFilter, phonetic matching

• Do you have issues with precision?• Consider using ANDs instead of ORs• Improve content quality? Search fewer fields?

• Is some content more important than other?• Consider boosting content with a boost query

Page 146: Language support and linguistics in lucene solr & its eco system

Thanks youJan Høydahl www.cominvent.com

Thanks for some slide materialBushra ZawaydehThanks for fun Arabic language lessons

Page 147: Language support and linguistics in lucene solr & its eco system

Example code• Example code is on Github

• https://github.com/atilika/lucenerevolution-2013• Get started using

• git clone git://github.com/atilika/lucenerevolution-2013.git• less lucenerevolution-2013/README.md

• Contact us if you have questions• [email protected]

Page 148: Language support and linguistics in lucene solr & its eco system

Q & A

Page 149: Language support and linguistics in lucene solr & its eco system

ありがとうございました

Thank you very much

شكرا جزي!

Vielen Dank

Merci beaucoup

Page 150: Language support and linguistics in lucene solr & its eco system

CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge gets you in the door

TOMORROW Breakfast starts at 7:30Keynotes start at 8:30

CONTACT (optional)Name (optional)email address (optional)