Download - Takahiko Ito @Lucene Revolution 2011
![Page 1: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/1.jpg)
1
Solr Cluster installation tool "Anuenue" and
"Did You Mean?" for Japanese
Takahiko Itomixi, Inc.
![Page 2: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/2.jpg)
2
mixi? One of the largest social
networking service in Japan.
Many services to promote communication among users.Blog, news, game
platform etcMost of the services
come with search 15M monthly active users
![Page 3: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/3.jpg)
3
Our current (urgent) project …Replace in-house search engines into a up-to-date search platform!
We have selected Apache Solr as the search platform! created a simple OSS package (Anuenue) which
wraps Solr
Project URL: http://code.google.com/p/anuenue-wrapper/
![Page 4: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/4.jpg)
Reason why we make AnuenueDeployment / daily operations of Solr search cluster is a bit difficult for ordinary engineers.
We need to edit the configuration files for all the Solr instances respectively
Commands for whole clusters are not provided• We need to write client commands by ourselves• Hadoop provides utility commands for clusters E.g., start-all.sh (start processes), fsck (check all
discs), balancer (rebalance the data blocks)
![Page 5: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/5.jpg)
5
What does Anuenue provide? Handy configuration of search clusters Commands for clusters
Simple commands (post, delete, update, commit etc)Start and stop commands for processes in cluster.
Japanese supportImplementation of Japanese Did-You-Mean facilitiesJapanese tokenizer (Sen and Kuromoji)
![Page 6: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/6.jpg)
6
Today’s Topics Anuenue
Handy configuration of search clustersCommands for search clusters
Did-You-Mean facilities for Japanese queriesCommon problem in Did-You-Mean implementationMining a Japanese Did-You-Mean dictionary from
query log data
![Page 7: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/7.jpg)
7
Cluster configuration with Anuenue Cluster setup is done with a special configuration file
Anuenue assigns more than one roles to instances. Roles are the functions in a cluster Anuenue supports three roles (Master, Slave,
Merger)
![Page 8: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/8.jpg)
8
Role: master Index input data.
NOTE: Anuenue provides a command to distribute the input data into master instances (build Solr shard indexes) .
Input Data
Master-1 Master-2 Master-3
Build shard indexes
![Page 9: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/9.jpg)
9
Role: slave
Has three functionsCopy (replicate) index
from masterAccept queries from
mergers and then search it own index
Return the results to merger instance
Input Data
Slave-1 Slave-2
Merger-1
Submit queries
Replicate index
Master-1 Master-2
Index input data
![Page 10: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/10.jpg)
10
Role: merger Forwards queries from
clients to slaves. Note: clients need not
to know the slave instances (merger adds ‘shard’ parameter with slave instances)
Merge the results from all the slave instances and returned the merged results.
Slave-1 Slave-2
Merger
Forwards queries
Client-1 Client-2
Submit queries
![Page 11: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/11.jpg)
11
Example: Anuenue cluster
The cluster consists of five machines
Each has one Anuenue instance
InstancesMerger: aaMaster: bb, ccSlave: dd, ee
Input Data
bb ee
cc dd
aa
Forward queries
Index input data
Client-1 Client-2
Replicate index
![Page 12: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/12.jpg)
12
How to assign roles to instance?
Edit cluster configuration file, anuenue-nodes.xml.• Add three elements (mergers, slaves and masters) • In each element, add more than one instance
information (machine name and port number).
![Page 13: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/13.jpg)
13
Configuration exampleCase: there is one merger instance in machine, aa (port 7000)
<mergers> <merger> <host>aa</host> <port>7000</port>
</merger> </mergers>
![Page 14: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/14.jpg)
14
Specify the index to replicate<masters> <master iname=“master1”> <host>aaaa</host> <port>8983</port> </master></masters><slaves>
<slave > <host>bbbb</host> <port>8983</port>
<replicate>master1</replicate> </slave>
</slaves>
Add name of master instance by iname attribute
Specify the master instance to copy the index adding replicate element
![Page 15: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/15.jpg)
15
Example: simple cluster settings
Input Data
bb
cc
aa
Forward queries
Index input data
Client-1 Client-2 <mergers> <merger> <host>aa</host> <port>8983</port> </merger> </mergers> <masters> <master iname=“master1”> <host>bb</host> <port>8983</port> </master> </masters> <slaves> <slave> <host>cc</host> <port>8983</port> <replicate>master1</replicate> </slave> </slaves>
Replicate index
![Page 16: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/16.jpg)
16
Cluster setup with Anuenue Flexible and support various types of search cluster.
For example…
![Page 17: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/17.jpg)
17
Assign multiple roles
Input Data
instance
Client1 Client2
Index input data
Submit queries
![Page 18: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/18.jpg)
18
Large clusters to handle huge data with high QPS
Input Data
Client1
Slave1
Client2
Merger1
Slave3Slave2 Slave4
Master1 Master2
Slave5 Slave6
Master3 Master4 Master5 Master6
Merger2 Merger3
Client3 ClientN…
![Page 19: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/19.jpg)
After setting up clusterWe can make use of commands for clusters.
Anuenue provides start / stop commands commands to manipulate the index
![Page 20: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/20.jpg)
Start and stop clustersUsers can start / stop clusters by a command (anuenue-distdaemon.sh).
Usage: $sh bin/anuenue-distdaemon.sh [start|stop]
![Page 21: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/21.jpg)
Simple commands for clusters
Anuenue also provides basic commands (‘post’, ‘delete’, ‘commit’, ‘optimize’ and ‘update’) for search cluster
The commands are implemented in multi-thread
E.g., $sh bin/anuenue-distcommands.sh post -arg inputDir
![Page 22: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/22.jpg)
22
Today’s Topics Anuenue
Handy cluster configuration of search clustersCommands for search clusters
Did-You-Mean facilities for Japanese queriesCommon problem in Did-You-Mean implementationMining a Japanese Did-You-Mean dictionary from
query log data
![Page 23: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/23.jpg)
23
What is Did-You-Mean service? Suggest correct spelling when users submit queries with
mistakes Increase the usability of search service
![Page 24: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/24.jpg)
24
Example: Did-You-Mean service
(English: Ugly Betty)
![Page 25: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/25.jpg)
25
Common implementation
Many search engines (including Solr) apply distance measures such as Edit Distance [Levenshtein, 1965]
Edit Distance: measure of distance between two sequences. Simply speaking, when two sequences have more common characters, the distance is smaller.
E.g., like likes (small distance) like foobar (large distance)
![Page 26: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/26.jpg)
26
Common procedure: Did-You-MeanWhen a user submits a query, 1. Did-You-Mean service computes edit distance between
input query and words in index.2. If there is a word whose distance is small,
Did-You-Mean handler suggests
E.g., when a user submit a query, “pthon”, Did-You-Mean service suggests a word in the index with small distance “python”.
![Page 27: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/27.jpg)
27
Problem: Japanese queries
Simple application of edit distance does not work for JapaneseMisspelled queries are sometimes totally different from
the correct one (large distance).E.g.,墨ともふどうさん (correct: 住友不動産 )米事案セット (correct: ベイジアンセット )
These cases are derived from Japanese input method.
![Page 28: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/28.jpg)
28
Typing in Japanese query
We input Japanese (query) words with two steps.1. Type the reading of the Japanese word in Latin
alphabet.2. Select a desired word from the list of candidates
This step cause a spelling mistake, too large distance to correct spelling
![Page 29: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/29.jpg)
29
Example: Typing in Japanese queries
Assume a user wants to submit a query: オバマ (Obama)
1. Type in the reading in Latin alphabet. reading: obama
2. Select correct spelling.Possible candidates: オバマ (correct), おばま , 小浜 etc.
![Page 30: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/30.jpg)
30
Japanese Did-You-Mean dictionary Because of the large distance problem, simple distance
measures (edit distance) do not work.
To handle this problem, Anuenue supports a special dictionary for Japanese Did-You-Mean service.
![Page 31: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/31.jpg)
31
Dictionary for Japanese Did-You-Mean service
Dictionary has two columns1.Query with mistakes2.Correct queries
Query with mistakes
Correct Query
墨ともふどうさん
住友不動産
歌だ光る 宇多田ヒカル
米事案セット ベイジアンセット
![Page 32: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/32.jpg)
32
Implementing Did-You-Mean service with the dictionary
When users submit the query with mistakes in dictionary, Did-You-Mean service
suggests the correct query
NOTE: Anuenue provideshandlers for the dictionary format.
Query with mistakes
Correct Query
墨ともふどうさん
住友不動産
歌だ光る 宇多田ヒカル
米事案セット ベイジアンセット
![Page 33: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/33.jpg)
33
Problem…How we can create the dictionary?We can make use of a query log mining tool Oluolu.
![Page 34: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/34.jpg)
34
Oluolu Creates a spelling correction dictionary from query log Extracts pairs of queries (query with spelling mistakes,
query with correct spelling)Support the Japanese spelling mistakes (from version
0.2) runs on the Hadoop framework
Project URL: http://code.google.com/p/oluolu/
![Page 35: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/35.jpg)
35
Input to Oluolu: query logThree columns
1. User Id 2. Query string3. Time of query
submission
User Id Query Time
438904 Pthon 2009-11-21 11:16:12
34443 Java 2009-11-21 12:16:13
438904 Python 2009-11-21 12:16:20
8975 Java Tomcat
2009-11-21 12:16:25
![Page 36: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/36.jpg)
36
Procedure: creating Japanese Did-You-Mean dictionary with Oluolu
Oluolu extracts the elements of Japanese Did-You-Mean dictionary with 2 steps.
1. Extract all the query pairs in the same session2. Validate the query pairs
![Page 37: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/37.jpg)
37
Step1: extract query pairs Oluolu extracts pairs of
queries in the same session.E.g., Oluolu extracts pair (Pthon and Python).
Queries in the same session: a set of queries submit by the same user within small time range.
Extracted pairs can be misspelled query and correct query.
User ID Query Time
438904 Pthon 2009-11-21 12:16:12
34443 Java 2009-11-21 12:16:13
438904 Python 2009-11-21 12:16:20
8975 Tomcat 2009-11-21 12:16:25
![Page 38: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/38.jpg)
38
Step 2: validate candidate pairs Oluolu validates all the query pairs extracted step 1. In validation phase (step 2), Oluolu makes use of query
readings.
![Page 39: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/39.jpg)
39
Reading of Japanese words Japanese words can be convert into the readings in Latin
Alphabets.こんにちは (reading: konnichiha)伊藤 (reading: itou)
FACT: even when Japanese query with spelling mistakes can be totally different from correct query,
the readings are the same or the distance is small!
![Page 40: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/40.jpg)
40
Validate candidate pair with readingGiven a query pairs, Oluolu validates the queries with 2 steps
1.Convert the queries into readings with Latin Alphabets2.Compute edit distance with the two readings
When the distance is small, the two queries are extracted as a element of Did-You-Mean dictionary.
![Page 41: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/41.jpg)
41
Example: step 2Given a pair of queries: ( 墨ともふどうさん , 住友不動産 )
1. Convert them into readings readings are the same, “sumitomofudousan”.
2. Compute the distance with the readings Distance is zero Extracted as a element of Did-You-Mean dictionary
![Page 42: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/42.jpg)
42
Creating Japanese Did-You-Mean dictionary with Oluolu
Installation requirementsJava 1.6.0 or greater Hadoop 0.20.0 or greater Oluolu 0.2.0 or greater
Copy the input query log into HDFS Run spellcheck task of oluolu $ bin/oluolu spellcheck -input testInput.txt -output output -inputLanguage ja
![Page 43: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/43.jpg)
Preliminary experiments Experimental settings
Input data: log file from a mixi service (community search).
• 5 GB data
Extracted dictionary number of elements is over 100.000 succeeded to extract the query pairs with large edit
distance.• ( 議 Ν, ギニュー )• ( 不動有利 , 不動裕理 )
![Page 44: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/44.jpg)
44
Current status Finished functional tests and stress tests. Now replacing an in-house search engine in a small
search service with Anuenue. In next phase, we will apply Anuenue to the search
service with large data and high QPS.
![Page 45: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/45.jpg)
45
Future work Integrate SolrCloud and Zookeeper
Support failover, and rebalance the index
Kuromoji, a new OSS Japanese tokenizer
![Page 46: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/46.jpg)
46
Summary Introduction of Anuenue Described a Did-You-Mean facility for Japanese query
![Page 47: Takahiko Ito @Lucene Revolution 2011](https://reader035.vdocuments.net/reader035/viewer/2022062502/568c4aa81a28ab491699136f/html5/thumbnails/47.jpg)
47
Thank you for your attention!