japanese linguistics in lucene and solr
DESCRIPTION
Presented by Christian Moen, Founder and CEO Atilika Inc - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 This talk gives an introduction to searching Japanese text and an overview of the new Japanese search features available out-of-the-box in Lucene and Solr. Atilika developed a new Japanese morphological analyzer (Kuromoji) in 2010 when they couldn't find any easy-to-use, high-quality morphological analyzer in Java that was good for both search and other Japanese NLP tasks. Kuromoji was built with the goal of donating it to the Apache Software Foundation in order to make Japanese work well for both Lucene and Solr, and is now a standard part of these software packages.TRANSCRIPT
![Page 2: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/2.jpg)
About me• MSc. in computer science, University of Oslo, Norway• Worked with search at FAST (now Microsoft) for 10 years
• 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway• 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
• Founded アティリカ株式会社 in 2009• We help companies innovate using search technologies and good ideas• We know information retrieval, natural language processing and big data• We are based in Tokyo, but we have clients everywhere
• Newbie Lucene & Solr Committer• Mostly been working on Japanese language support (Kuromoji) so far
• Please write me on [email protected] or [email protected]
![Page 3: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/3.jpg)
Today’s topics
![Page 4: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/4.jpg)
Today’s topics
• Japanese 101 - ordering beer and toasting
• Japanese language processing
• Japanese features in Lucene/Solr
![Page 5: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/5.jpg)
Today’s topics
• Japanese 101 - ordering beer and toasting
• Japanese language processing
• Japanese features in Lucene/Solr
![Page 6: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/6.jpg)
Today’s topics
• Japanese 101 - ordering beer and toasting
• Japanese language processing
• Japanese features in Lucene/Solr
![Page 7: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/7.jpg)
Japanese 101
![Page 8: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/8.jpg)
ビールくださいbi-ru kudasai
![Page 9: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/9.jpg)
ビールくださいbi-ru kudasai
A beer, please
![Page 10: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/10.jpg)
ありがとうございます!arigatō gozaimasu!
![Page 11: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/11.jpg)
ありがとうございます!
Thank you very much!
arigatō gozaimasu!
![Page 12: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/12.jpg)
乾杯!kanpai!
![Page 13: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/13.jpg)
Cheers!
乾杯!kanpai!
![Page 14: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/14.jpg)
JR新宿駅の近くにビールを飲みに行こうか?JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
![Page 15: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/15.jpg)
JR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
![Page 16: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/16.jpg)
JR新宿駅の近くにビールを飲みに行こうか?
![Page 17: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/17.jpg)
Romaji - ローマ字・Latin characters (26+)・Used for proper nouns, etc.
JR新宿駅の近くにビールを飲みに行こうか?
![Page 18: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/18.jpg)
Katakana - カタカナ・Phonetic script (~50)・Typically used for loan words
JR新宿駅の近くにビールを飲みに行こうか?
![Page 19: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/19.jpg)
Kanji - 漢字・Chinese characters (50,000+)・Used for stems & proper nouns
JR新宿駅の近くにビールを飲みに行こうか?
![Page 20: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/20.jpg)
Hiragana - ひらがな・Phonetic script (~50)・Used for inflections & particles
JR新宿駅の近くにビールを飲みに行こうか?
![Page 21: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/21.jpg)
Katakana - カタカナ・Phonetic script・Typically used for loan words
JR新宿駅の近くにビールを飲みに行こうか?
Kanji - 漢字・Chinese characters (50,000+)・Used for stems & proper nouns
Hiragana - ひらがな・Phonetic script (~50)・Used for inflections & particles
Romaji - ローマ字・Latin characters (26+)・Used for proper nouns, etc.
Katakana - カタカナ・Phonetic script (~50)・Typically used for loan words
![Page 22: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/22.jpg)
JR新宿駅の近くにビールを飲みに行こうか?
![Page 23: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/23.jpg)
JR新宿駅の近くにビールを飲みに行こうか?
What are the words in this sentence??
![Page 24: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/24.jpg)
JR新宿駅の近くにビールを飲みに行こうか?
What are the words in this sentence?Words are implicit in Japanese - there is no white space that separates them
?!
![Page 25: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/25.jpg)
JR新宿駅の近くにビールを飲みに行こうか?
How do we index this for search, then??
![Page 26: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/26.jpg)
JR新宿駅の近くにビールを飲みに行こうか?
How do we index this for search, then?We need to segment text into tokens first
?!
![Page 27: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/27.jpg)
1. n-gramming2. morphological analysis
(statistical approach)
Two major approaches for segmentation!
![Page 28: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/28.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
![Page 29: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/29.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
n=2
JR
JR
Shall we go for a beer near JR Shinjuku station?
![Page 30: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/30.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
n=2
JR
R新
JRR新
Shall we go for a beer near JR Shinjuku station?
![Page 31: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/31.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
n=2
JR
R新
新宿
JRR新新宿
Shall we go for a beer near JR Shinjuku station?
![Page 32: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/32.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
n=2
JR
R新
新宿
宿駅
JRR新新宿宿駅
Shall we go for a beer near JR Shinjuku station?
![Page 33: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/33.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
n=2
JR
R新
新宿
宿駅
JRR新新宿宿駅駅の
駅の
Shall we go for a beer near JR Shinjuku station?
![Page 34: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/34.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
n=2
JR
R新
新宿
宿駅
JRR新新宿宿駅駅のの近
駅の
の近
Shall we go for a beer near JR Shinjuku station?
![Page 35: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/35.jpg)
n-gramming (n=2)JR新宿駅の近くにビールを飲みに行こうか?
n=2
JR
R新
新宿
宿駅
JRR新新宿宿駅駅のの近近く
駅の
の近
近く
Shall we go for a beer near JR Shinjuku station?
![Page 36: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/36.jpg)
Problems with n-gramming
![Page 37: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/37.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
![Page 38: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/38.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...●
![Page 39: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/39.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
×●
![Page 40: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/40.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
×● ●
![Page 41: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/41.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
× ×● ●change ofsemantics!
means ‘post town’, ‘relay station’ or ‘stage’
![Page 42: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/42.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
× × ×● ●change ofsemantics!
means ‘post town’, ‘relay station’ or ‘stage’
![Page 43: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/43.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
× × × ×● ●change ofsemantics!
means ‘post town’, ‘relay station’ or ‘stage’
![Page 44: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/44.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
× × × ×● ● ●change ofsemantics!
means ‘post town’, ‘relay station’ or ‘stage’
![Page 45: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/45.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
× × × ×● ● ●change ofsemantics!
means ‘post town’, ‘relay station’ or ‘stage’
• Does not preserve meaning well and often changes semantics• Impacts on ranking - search precision (many false positives)
Generates many terms per document or queryImpacts on index size and search performance
Sometimes appropriate for certain search applicationsCompliance, e-commerce with non product names, ...
![Page 46: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/46.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
× × × ×● ● ●change ofsemantics!
means ‘post town’, ‘relay station’ or ‘stage’
• Does not preserve meaning well and often changes semantics• Impacts on ranking - search precision (many false positives)
• Also generates many terms per document or query• Impacts on index size and performance
Sometimes appropriate for certain search applicationsCompliance, e-commerce with non product names, ...
![Page 47: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/47.jpg)
Problems with n-grammingJRR新新宿宿駅駅のの近近く ...
× × × ×● ● ●change ofsemantics!
means ‘post town’, ‘relay station’ or ‘stage’
• Does not preserve meaning well and often changes semantics• Impacts on ranking - search precision (many false positives)
• Also generates many terms per document or query• Impacts on index size and performance
• Still sometimes appropriate for certain search applications• Compliance, e-commerce with special product names, ...
![Page 48: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/48.jpg)
Morphological analysisJR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
![Page 49: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/49.jpg)
Morphological analysisJR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
JR新宿駅の近くにビールを飲みに行こうか?
![Page 50: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/50.jpg)
Morphological analysisJR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
●●● ● ● ● ● ● ● ● ● ● ● ●JR新宿駅の近くにビールを飲みに行こうか?
![Page 51: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/51.jpg)
Morphological analysisJR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
• Tokens reflect what a Japanese speaker consider as words• Machine-learned statistical approach
• CRFs decoded using Viterbi• Also does part-of-speech tagging, readings for kanji, etc.
• Several statistical models available with high accuracy (F > 0.97)• Models/dictionaries are available as IPADIC, UniDic, ...
●●● ● ● ● ● ● ● ● ● ● ● ●
![Page 52: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/52.jpg)
Morphological analysisJR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
• Tokens reflect what a Japanese speaker consider as words• Machine-learned statistical approach
• Conditional Random Fields (CRFs) decoded using Viterbi• Also does part-of-speech tagging, extract readings for kanji, etc.
• Several statistical models available with high accuracy (F > 0.97)• Models/dictionaries are available as IPADIC, UniDic, ...
●●● ● ● ● ● ● ● ● ● ● ● ●
![Page 53: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/53.jpg)
Morphological analysisJR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
Shall we go for a beer near JR Shinjuku station?
• Tokens reflect what a Japanese speaker consider as words• Machine-learned statistical approach
• Conditional Random Fields (CRFs) decoded using Viterbi• Also does part-of-speech tagging, readings for kanji, etc.
• Several statistical models available with high accuracy (F > 0.97)• Models/dictionaries are available as IPADIC, UniDic, ...
●●● ● ● ● ● ● ● ● ● ● ● ●
![Page 54: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/54.jpg)
How does this actually work?
![Page 55: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/55.jpg)
Demo
![Page 56: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/56.jpg)
Japanese support in Lucene and Solr
![Page 57: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/57.jpg)
Japanese in Lucene/Solr
![Page 58: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/58.jpg)
New feature in Lucene/Solr 3.6!
Japanese in Lucene/Solr
![Page 59: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/59.jpg)
New feature in Lucene/Solr 3.6!
Available out-of-the-box!
Japanese in Lucene/Solr
![Page 60: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/60.jpg)
New feature in Lucene/Solr 3.6!
Available out-of-the-box!
Easy to use with reasonable defaults!
Japanese in Lucene/Solr
![Page 61: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/61.jpg)
New feature in Lucene/Solr 3.6!
Available out-of-the-box!
Easy to use with reasonable defaults!
Provides sophisticated Japanese linguistics!
Japanese in Lucene/Solr
![Page 62: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/62.jpg)
New feature in Lucene/Solr 3.6!
Available out-of-the-box!
Customisable!
Easy to use with reasonable defaults!
Provides sophisticated Japanese linguistics!
Japanese in Lucene/Solr
![Page 63: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/63.jpg)
How do we use it?
![Page 64: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/64.jpg)
Use JapaneseAnalyzer!
How do we use it?
![Page 65: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/65.jpg)
Use JapaneseAnalyzer!
Use field type “text_ja” in example schema.xml
!
How do we use it?
![Page 66: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/66.jpg)
Demo
![Page 67: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/67.jpg)
Feature summary / text_ja analyzer chain
JapaneseTokenizer
Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries
![Page 68: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/68.jpg)
JapaneseTokenizer
Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Feature summary / text_ja analyzer chain
![Page 69: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/69.jpg)
JapaneseTokenizer
Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
JapanesePartOfSpeechStopFilterStop-words removal based on part-of-speech tagsSee example/solr/conf/lang/stoptags_ja.txt
Feature summary / text_ja analyzer chain
![Page 70: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/70.jpg)
JapaneseTokenizer
Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
JapanesePartOfSpeechStopFilterStop-words removal based on part-of-speech tagsSee example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Feature summary / text_ja analyzer chain
![Page 71: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/71.jpg)
JapaneseTokenizer
Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
JapanesePartOfSpeechStopFilterStop-words removal based on part-of-speech tagsSee example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
StopFilterStop-words removal
See example/solr/conf/lang/stopwords_ja.txt
Feature summary / text_ja analyzer chain
![Page 72: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/72.jpg)
JapaneseTokenizer
Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
JapanesePartOfSpeechStopFilterStop-words removal based on part-of-speech tagsSee example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
StopFilterStop-words removal
See example/solr/conf/lang/stopwords_ja.txt
JapaneseKatakanaStemFilter Normalises common katakana spelling variations
Feature summary / text_ja analyzer chain
![Page 73: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/73.jpg)
JapaneseTokenizer
Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
JapanesePartOfSpeechStopFilterStop-words removal based on part-of-speech tagsSee example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
StopFilterStop-words removal
See example/solr/conf/lang/stopwords_ja.txt
JapaneseKatakanaStemFilter Normalises common katakana spelling variations
LowerCaseFilter Lowercases
Feature summary / text_ja analyzer chain
![Page 74: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/74.jpg)
Feature details
![Page 75: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/75.jpg)
Compound nounsHow do we deal with compound nouns??
![Page 76: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/76.jpg)
Compound nounsHow do we deal with compound nouns??
Japanese English関西国際空港 Kansai International Airport
シニアソフトウェアエンジニア Senior Software Engineer
![Page 77: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/77.jpg)
Compound nounsHow do we deal with compound nouns??
Japanese English関西国際空港 Kansai International Airport
シニアソフトウェアエンジニア Senior Software Engineer
These are one word in Japanese, so searching for 空港 (airport) doesn’t match
!
![Page 78: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/78.jpg)
Compound nounsHow do we deal with compound nouns?
We need to segment the compounds, too
?
!
Japanese English関西国際空港 Kansai International Airport
シニアソフトウェアエンジニア Senior Software Engineer
These are one word in Japanese, so searching for 空港 (airport) doesn’t match
!
![Page 79: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/79.jpg)
Compound segmentation
関西国際空港Kansai International Airport
シニアソフトウェアエンジニナ
Senior Software Engineer
We are using a heuristic to implement this!
![Page 80: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/80.jpg)
Compound segmentation
関西国際空港Kansai International Airport
関西Kansai
シニアソフトウェアエンジニナ
Senior Software Engineerシニア
Senior
We are using a heuristic to implement this!
![Page 81: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/81.jpg)
Compound segmentation
関西国際空港Kansai International Airport
関西Kansai
国際International
シニアソフトウェアエンジニナ
Senior Software Engineerシニア
Seniorソフトウェア
Software
We are using a heuristic to implement this!
![Page 82: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/82.jpg)
Compound segmentation
関西国際空港Kansai International Airport
関西Kansai
国際International
空港Airport
シニアソフトウェアエンジニナ
Senior Software Engineerシニア
Seniorソフトウェア
Softwareエンジニナ
Engineer
We are using a heuristic to implement this!
![Page 83: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/83.jpg)
Compound synonym tokensPosition 1 Position 2 Position 3関西 国際 空港
関西国際空港
• Segment the compounds into its part• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking• JapaneseTokenizer actually returns a graph of tokens
![Page 84: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/84.jpg)
Compound synonym tokensPosition 1 Position 2 Position 3関西 国際 空港
関西国際空港
• Segment the compounds into its parts• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking• JapaneseTokenizer actually returns a graph of tokens
![Page 85: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/85.jpg)
Compound synonym tokensPosition 1 Position 2 Position 3関西 国際 空港
関西国際空港
• Segment the compounds into its parts• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking• JapaneseTokenizer actually returns a graph of tokens
![Page 86: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/86.jpg)
Compound synonym tokensPosition 1 Position 2 Position 3関西 国際 空港
関西国際空港
• Segment the compounds into its parts• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking• JapaneseTokenizer actually returns a graph of tokens
![Page 87: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/87.jpg)
Character width normalisationHow do we deal with character widths??
Half-width・半角 Full-width・全角Lucene Luceneカタカナ カタカナ123 1 2 3
![Page 88: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/88.jpg)
Character width normalisation
Half-width・半角 Full-width・全角Lucene Luceneカタカナ カタカナ123 1 2 3
Input text Lucene カタカナ 1 2 3
CJKWidthFilter Lucene カタカナ 1 2 3
half-width full-width half-width
Use CJKWidthFilter to normalise them(Unicode NFKC subset)
!
How do we deal with character widths??
![Page 89: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/89.jpg)
Katakana end-vowel stemming
English Japanese spelling variations Japanese spelling variations Japanese spelling variationsmanager マネージャー マネージャ マネジャー
A common spelling variation in katakana is a end long-vowel sound
?
![Page 90: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/90.jpg)
Katakana end-vowel stemming
English Japanese spelling variations Japanese spelling variations Japanese spelling variationsmanager マネージャー マネージャ マネジャー
Input text コピー マネージャー マネージャ マネジャーJapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ
copy manager manager “manager”
We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms
!
A common spelling variation in katakana is a end long-vowel sound
?
![Page 91: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/91.jpg)
LemmatisationJapanese adjectives and verbs are highly inflected, how do we deal with that?
?
![Page 92: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/92.jpg)
LemmatisationJapanese adjectives and verbs are highly inflected, how do we deal with that?
?
kauto buy
買うDictionary form
![Page 93: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/93.jpg)
LemmatisationJapanese adjectives and verbs are highly inflected, how do we deal with that?
?
kauto buy
買うDictionary form
買いなさい買いなさるな買いましたら買いましたり買いまして買いましょう買います買いますまい買いませば買いません買いませんで買いませんでした
買える 買おう買った買ったら買ったり買って買わせない買わせます買わせません買わせられない買わせられます買わせられません
Inflected forms (not exhaustive)買いませんでしたら買いませんでしたり買いませんなら買うだろう買うでしょう買うな買うまい買え買えない買えば買えます買えません
買わせられる買わせる買わない買わないだろう買わないで買わないでしょう買わなかった買わなかったら買わなかったり買わなければ買われない買われます
![Page 94: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/94.jpg)
LemmatisationJapanese adjectives and verbs are highly inflected, how do we deal with that?
?
kauto buy
買うDictionary form
買いなさい買いなさるな買いましたら買いましたり買いまして買いましょう買います買いますまい買いませば買いません買いませんで買いませんでした
買える 買おう買った買ったら買ったり買って買わせない買わせます買わせません買わせられない買わせられます買わせられません
Inflected forms (not exhaustive)買いませんでしたら買いませんでしたり買いませんなら買うだろう買うでしょう買うな買うまい買え買えない買えば買えます買えません
買わせられる買わせる買わない買わないだろう買わないで買わないでしょう買わなかった買わなかったら買わなかったり買わなければ買われない買われます
Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form(lemmatisation by reduction)
!
![Page 95: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/95.jpg)
User dictionaries• Own dictionaries can be used for ad hoc segmentation, i.e. to override default model
• File format is simple and there’s no need to assign weights, etc. before using them
• Example custom dictionary:# Custom segmentation and POS entry for long entries関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞
# Custom reading and POS former sumo wrestler Asashoryu朝青龍,朝青龍,アサショウリュウ,カスタム人名
![Page 96: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/96.jpg)
Japanese focus in 4.0• Improvements in JapaneseTokenizer
• Improved search mode for katakana compounds• Improved unknown word segmentation• Some performance improvements
• CharFilters for various character normalisations• Dates and numbers• Repetition marks (odoriji)
• Japanese spell-checker• Robert and Koji almost got this into 3.6, but it got
postponed because of API changes being necessary
![Page 97: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/97.jpg)
AcknowledgementsRobert MuirThanks for the heavy lifting integrating Kuromoji into Lucene and always reviewing my patches quickly and friendly helpMichael McCandlessThanks for streaming Viterbi and synonym compounds!Uwe SchindlerThanks for performance improvements + being the policemanSimon WillnauerThanks for doing the Kuromoji code donation process so wellGaute Lambertsen & Gerry HocksThanks for presentation feedback and being great colleagues
![Page 98: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/98.jpg)
Q & A
![Page 99: Japanese Linguistics in Lucene and Solr](https://reader033.vdocuments.net/reader033/viewer/2022052323/558c9a34d8b42a72018b462e/html5/thumbnails/99.jpg)
ありがとうございました!
Thank you very much!
arigatō gozaimashita!