solr cluster installation tool "anuenue"

47
1 Solr Cluster installation tool "Anuenue" and "Did You Mean?" for Japanese Takahiko Ito mixi, Inc.

Upload: lucidworks-archived

Post on 27-Jun-2015

1.418 views

Category:

Technology


1 download

DESCRIPTION

Mixi is one of the largest social networking services in Japan, providing various communication services for over 14M monthly active users. The latest internal mixi project is to replace the in-house search engine with Apache Solr. This session covers two topics; a simple packaging system for Solr that eases the installation process and daily operations, and implementation of a "Did you mean" facility for Japanese queries using a log mining tool. These tools have been released as OSS projects.

TRANSCRIPT

Page 1: Solr Cluster installation tool "Anuenue"

1

Solr Cluster installation tool "Anuenue" and

"Did You Mean?" for Japanese

Takahiko Ito

mixi, Inc.

Page 2: Solr Cluster installation tool "Anuenue"

2

mixi?

One of the largest social networking service in Japan.

Many services to promote communication among users.Blog, news, game

platform etcMost of the services

come with search 15M monthly active users

Page 3: Solr Cluster installation tool "Anuenue"

3

Our current (urgent) project …

Replace in-house search engines into a up-to-date search

platform!

We have selected Apache Solr as the search platform! created a simple OSS package (Anuenue) which

wraps Solr

Project URL: http://code.google.com/p/anuenue-wrapper/

Page 4: Solr Cluster installation tool "Anuenue"

Reason why we make Anuenue

Deployment / daily operations of Solr search cluster is a bit

difficult for ordinary engineers.We need to edit the configuration files for all the Solr

instances respectivelyCommands for whole clusters are not provided

• We need to write client commands by ourselves• Hadoop provides utility commands for clusters

E.g., start-all.sh (start processes),  fsck (check all discs), balancer (rebalance the data blocks)

Page 5: Solr Cluster installation tool "Anuenue"

5

What does Anuenue provide?

Handy configuration of search clusters Commands for clusters

Simple commands (post, delete, update, commit etc)Start and stop commands for processes in cluster.

Japanese supportImplementation of Japanese Did-You-Mean facilitiesJapanese tokenizer (Sen and Kuromoji)

Page 6: Solr Cluster installation tool "Anuenue"

6

Today’s Topics

AnuenueHandy configuration of search clustersCommands for search clusters

Did-You-Mean facilities for Japanese queriesCommon problem in Did-You-Mean implementationMining a Japanese Did-You-Mean dictionary from

query log data

Page 7: Solr Cluster installation tool "Anuenue"

7

Cluster configuration with Anuenue

Cluster setup is done with a special configuration file

Anuenue assigns more than one roles to instances. Roles are the functions in a cluster Anuenue supports three roles (Master, Slave,

Merger)

Page 8: Solr Cluster installation tool "Anuenue"

8

Role: master

Index input data.

NOTE: Anuenue provides a command to distribute the input

data into master instances (build Solr shard indexes) .

Input Data

Master-1 Master-2 Master-3

Build shard indexes

Page 9: Solr Cluster installation tool "Anuenue"

9

Role: slave

Has three functionsCopy (replicate) index

from masterAccept queries from

mergers and then search it own index

Return the results to merger instance

Input Data

Slave-1 Slave-2

Merger-1

Submit queries

Replicate index

Master-1 Master-2

Index input data

Page 10: Solr Cluster installation tool "Anuenue"

10

Role: merger Forwards queries from

clients to slaves. Note: clients need not

to know the slave instances (merger adds ‘shard’ parameter with slave instances)

Merge the results from all the slave instances and returned the merged results.

Slave-1 Slave-2

Merger

Forwards queries

Client-1 Client-2

Submit queries

Page 11: Solr Cluster installation tool "Anuenue"

11

Example: Anuenue cluster

The cluster consists of five

machines Each has one

Anuenue instance

InstancesMerger: aaMaster: bb, ccSlave: dd, ee

Input Data

bb ee

cc dd

aa

Forward queries

Index input data

Client-1 Client-2

Replicate index

Page 12: Solr Cluster installation tool "Anuenue"

12

How to assign roles to instance?

Edit cluster configuration file, anuenue-nodes.xml.• Add three elements (mergers, slaves and masters) • In each element, add more than one instance

information (machine name and port number).

Page 13: Solr Cluster installation tool "Anuenue"

13

Configuration example

Case: there is one merger instance in machine, aa (port

7000)

<mergers>

<merger>

<host>aa</host>

<port>7000</port> </merger>

</mergers>

Page 14: Solr Cluster installation tool "Anuenue"

14

Specify the index to replicate

<masters>

<master iname=“master1”>

<host>aaaa</host>

<port>8983</port>

</master>

</masters>

<slaves><slave > <host>bbbb</host> <port>8983</port>

<replicate>master1</replicate> </slave>

</slaves>

Add name of master instance by iname attribute

Specify the master instance to copy the index adding replicate element

Page 15: Solr Cluster installation tool "Anuenue"

15

Example: simple cluster settings

Input Data

bb

cc

aa

Forward queries

Index input data

Client-1 Client-2 <mergers> <merger> <host>aa</host> <port>8983</port> </merger> </mergers> <masters> <master iname=“master1”> <host>bb</host> <port>8983</port> </master> </masters> <slaves> <slave> <host>cc</host> <port>8983</port> <replicate>master1</replicate> </slave> </slaves>

Replicate index

Page 16: Solr Cluster installation tool "Anuenue"

16

Cluster setup with Anuenue

Flexible and support various types of search cluster.

For example…

Page 17: Solr Cluster installation tool "Anuenue"

17

Assign multiple roles

Input Data

instance

Client1 Client2

Index input data

Submit queries

Page 18: Solr Cluster installation tool "Anuenue"

18

Large clusters to handle huge data with high QPS

Input Data

Client1

Slave1

Client2

Merger1

Slave3Slave2 Slave4

Master1 Master2

Slave5 Slave6

Master3 Master4 Master5 Master6

Merger2 Merger3

Client3 ClientN…

Page 19: Solr Cluster installation tool "Anuenue"

After setting up cluster

We can make use of commands for clusters.

Anuenue provides start / stop commands commands to manipulate the index

Page 20: Solr Cluster installation tool "Anuenue"

Start and stop clusters

Users can start / stop clusters by a command

(anuenue-distdaemon.sh).

Usage:

$sh bin/anuenue-distdaemon.sh [start|stop]

Page 21: Solr Cluster installation tool "Anuenue"

Simple commands for clusters

Anuenue also provides basic commands (‘post’, ‘delete’,

‘commit’, ‘optimize’ and ‘update’) for search cluster  The commands are implemented in multi-thread

E.g.,

$sh bin/anuenue-distcommands.sh post -arg inputDir

Page 22: Solr Cluster installation tool "Anuenue"

22

Today’s Topics

AnuenueHandy cluster configuration of search clustersCommands for search clusters

Did-You-Mean facilities for Japanese queriesCommon problem in Did-You-Mean implementationMining a Japanese Did-You-Mean dictionary from

query log data

Page 23: Solr Cluster installation tool "Anuenue"

23

What is Did-You-Mean service?

Suggest correct spelling when users submit queries with mistakes

Increase the usability of search service

Page 24: Solr Cluster installation tool "Anuenue"

24

Example: Did-You-Mean service

(English: Ugly Betty)

Page 25: Solr Cluster installation tool "Anuenue"

25

Common implementation

Many search engines (including Solr) apply distance

measures such as Edit Distance [Levenshtein, 1965]

Edit Distance: measure of distance between two sequences.

Simply speaking, when two sequences have more common

characters, the distance is smaller.

E.g.,

like likes (small distance)

like foobar (large distance)

Page 26: Solr Cluster installation tool "Anuenue"

26

Common procedure: Did-You-Mean

When a user submits a query,

1. Did-You-Mean service computes edit distance between input query and words in index.

2. If there is a word whose distance is small, Did-You-Mean handler suggests

E.g., when a user submit a query, “pthon”, Did-You-Mean

service suggests a word in the index with small distance

“python”.

Page 27: Solr Cluster installation tool "Anuenue"

27

Problem: Japanese queries

Simple application of edit distance does not work for

JapaneseMisspelled queries are sometimes totally different from

the correct one (large distance).

E.g.,墨ともふどうさん (correct: 住友不動産 )米事案セット (correct: ベイジアンセット )

These cases are derived from Japanese input method.

Page 28: Solr Cluster installation tool "Anuenue"

28

Typing in Japanese query

We input Japanese (query) words with two steps.

1. Type the reading of the Japanese word in Latin alphabet.

2. Select a desired word from the list of candidates

This step cause a spelling mistake, too large distance to correct spelling

Page 29: Solr Cluster installation tool "Anuenue"

29

Example: Typing in Japanese queries

Assume a user wants to submit a query:

オバマ (Obama)

1. Type in the reading in Latin alphabet.

reading: obama

2. Select correct spelling.

Possible candidates: オバマ (correct), おばま , 小浜 etc.

Page 30: Solr Cluster installation tool "Anuenue"

30

Japanese Did-You-Mean dictionary

Because of the large distance problem, simple distance measures (edit distance) do not work.

To handle this problem, Anuenue supports a special dictionary for Japanese Did-You-Mean service.

Page 31: Solr Cluster installation tool "Anuenue"

31

Dictionary for Japanese Did-You-Mean service

Dictionary has two columns

1.Query with mistakes

2.Correct queries

Query with mistakes

Correct Query

墨ともふどうさん

住友不動産

歌だ光る 宇多田ヒカル

米事案セット ベイジアンセット

Page 32: Solr Cluster installation tool "Anuenue"

32

Implementing Did-You-Mean service with the dictionary

When users submit the

query with mistakes in

dictionary, Did-You-Mean service

suggests the correct query

NOTE: Anuenue provides

handlers for the dictionary

format.

Query with mistakes

Correct Query

墨ともふどうさん

住友不動産

歌だ光る 宇多田ヒカル

米事案セット ベイジアンセット

Page 33: Solr Cluster installation tool "Anuenue"

33

Problem…

How we can create the dictionary?We can make use of a query log mining tool Oluolu.

Page 34: Solr Cluster installation tool "Anuenue"

34

Oluolu

Creates a spelling correction dictionary from query log Extracts pairs of queries (query with spelling mistakes,

query with correct spelling)Support the Japanese spelling mistakes (from version

0.2) runs on the Hadoop framework

Project URL: http://code.google.com/p/oluolu/

Page 35: Solr Cluster installation tool "Anuenue"

35

Input to Oluolu: query log

Three columns

1. User Id

2. Query string

3. Time of query submission

User Id Query Time

438904 Pthon 2009-11-21 11:16:12

34443 Java 2009-11-21 12:16:13

438904 Python 2009-11-21 12:16:20

8975 Java Tomcat

2009-11-21 12:16:25

Page 36: Solr Cluster installation tool "Anuenue"

36

Procedure: creating Japanese Did-You-Mean dictionary with Oluolu

Oluolu extracts the elements of Japanese Did-You-Mean

dictionary with 2 steps.

1. Extract all the query pairs in the same session

2. Validate the query pairs

Page 37: Solr Cluster installation tool "Anuenue"

37

Step1: extract query pairs

Oluolu extracts pairs of queries in the same session.

E.g., Oluolu extracts pair (Pthon and Python).

Queries in the same session: a set of queries submit by the same user within small time range.

Extracted pairs can be misspelled query and correct query.

User ID Query Time

438904 Pthon 2009-11-21 12:16:12

34443 Java 2009-11-21 12:16:13

438904 Python 2009-11-21 12:16:20

8975 Tomcat 2009-11-21 12:16:25

Page 38: Solr Cluster installation tool "Anuenue"

38

Step 2: validate candidate pairs

Oluolu validates all the query pairs extracted step 1. In validation phase (step 2), Oluolu makes use of query

readings.

Page 39: Solr Cluster installation tool "Anuenue"

39

Reading of Japanese words

Japanese words can be convert into the readings in Latin Alphabets.こんにちは (reading: konnichiha)伊藤 (reading: itou)

FACT: even when Japanese query with spelling mistakes

can be totally different from correct query, the readings are the same or the distance is small!

Page 40: Solr Cluster installation tool "Anuenue"

40

Validate candidate pair with reading

Given a query pairs, Oluolu validates the queries with 2

steps

1.Convert the queries into readings with Latin Alphabets

2.Compute edit distance with the two readings When the distance is small, the two queries are

extracted as a element of Did-You-Mean dictionary.

Page 41: Solr Cluster installation tool "Anuenue"

41

Example: step 2

Given a pair of queries: ( 墨ともふどうさん , 住友不動産 )

1. Convert them into readings readings are the same, “sumitomofudousan”.

2. Compute the distance with the readings Distance is zero Extracted as a element of Did-You-Mean dictionary

Page 42: Solr Cluster installation tool "Anuenue"

42

Creating Japanese Did-You-Mean dictionary with Oluolu

Installation requirementsJava 1.6.0 or greater Hadoop 0.20.0 or greater Oluolu 0.2.0 or greater

Copy the input query log into HDFS Run spellcheck task of oluolu

$ bin/oluolu spellcheck

-input testInput.txt

-output output

-inputLanguage ja

Page 43: Solr Cluster installation tool "Anuenue"

Preliminary experiments

Experimental settings Input data: log file from a mixi service (community

search).• 5 GB data

Extracted dictionary number of elements is over 100.000 succeeded to extract the query pairs with large edit

distance.• ( 議 Ν, ギニュー )• ( 不動有利 , 不動裕理 )

Page 44: Solr Cluster installation tool "Anuenue"

44

Current status

Finished functional tests and stress tests. Now replacing an in-house search engine in a small

search service with Anuenue. In next phase, we will apply Anuenue to the search

service with large data and high QPS.

Page 45: Solr Cluster installation tool "Anuenue"

45

Future work

Integrate SolrCloud and ZookeeperSupport failover, and rebalance the index

Kuromoji, a new OSS Japanese tokenizer

Page 46: Solr Cluster installation tool "Anuenue"

46

Summary

Introduction of Anuenue Described a Did-You-Mean facility for Japanese query

Page 47: Solr Cluster installation tool "Anuenue"

47

Thank you for your attention!