simple fuzzy name matching in solr: presented by chris mack, basis technology

24
OCTOBER 13-16, 2016 AUSTIN, TX

Upload: lucidworks

Post on 16-Apr-2017

915 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Page 2: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

Simple Fuzzy Name Matching in Solr Chris Mack

Director Customer Engineering Basis Technology

Page 3: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Page 4: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Page 5: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

5

02Why Match Names?

Just a Name….

...Right?

1.  Security 2.  Fraud 3.  Commerce

Page 6: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

6

01Quick survey: How many of you...

•  Regularly develop Solr applications? •  Develop Solr applications that include names of… ...People? ...Places? ...Products? ...Organizations? •  Have names in languages beside English?

Page 7: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

7

03What Makes Name Matching Hard?

Page 8: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

8

01Name Variety

Page 9: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

9

01Name Variety

Page 10: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

10

01Name Ambiguity

Page 11: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

11

01How Would You Solve It?

Page 12: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

12

01Best Practice: field per variation type?

Page 13: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

13

01Idea: Create a Custom Solr Field

•  Contribute score that reflects phenomena. •  Be part of queries using many field types. •  Have multiple fields per document. •  Have multiple values per field.

Page 14: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

14

01But what if variations co-occur?

“Jesus Alfonso Lopez Diaz” v.

“LobezDias, Chuy” 1) Reordered. 2) Nickname for first name. 3) Missing 2nd Name. 4) Two spelling differences. 5) Missing space.

Page 15: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

15

01Can We Do Better?

•  Incorporate our proprietary name matching •  Provide similarity scores to name pairs •  Use Solr’s Rerank feature •  Allows for higher precision ranking and tresholding •  Provides multi-lingual name search

Page 16: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

16

01Simple to Configure

•  Plugin contains custom field type which does all the work behind the scenes

•  Simple addition to schema.xml to include new field type

<fieldType name="rni_name" class="com.basistech.rni.solr.NameField"/> <field name="name" type="rni_name" indexed="true" stored="true" multiValued="false"/> <field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>

Page 17: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

17

01Plug-in Implementation

Page 18: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

18

01What happens at query time?

•  Step #1: NameField generates analogous keys for a custom Lucene query that finds good candidates for re-ranking

public Query getFieldQuery(QParser parser, SchemaField field, String val) { Name name = parseNameString(externalVal, parser.getParams()); QuerySpec querySpec = buildQuery(name); return querySpec.accept(new SolrQueryVisitor(field.getName())); }

Page 19: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

19

01What else happens at query time?

•  Step #2: Uses Solr’s Rerank feature to rescore names in top documents and reorder accordingly

- Tuned for high precision - Simple addition to solrconfig.xml

<queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/> <valueSourceParser name="rniMatch” class="com.basistech.rni.solr.NameMatchValueSourceParser"/>

Page 20: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

20

01Plug-in Implementation

Page 21: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

21

01Ability to Tradeoff Accuracy vs. Speed

•  reRankScoreThreshold - Score threshold top doc must meet to be rescored.

•  reRankDocs - Controls how many of the top documents to rescore

Page 22: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

22

01Summary: How it works

•  Custom field type - Splits a single field into multiple fields covering different phenomena - Supports multiple name fields in a document as well as multivalued fields - Intercepts the query to inject a custom Lucene query

•  Custom rerank function - Rescores documents with algorithm specific to name matching - Limits intense calculations to only top candidates - Highly configurable

Page 23: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

23

01Suggested Questions:

•  What is names are in unstructured text? •  What if the names are in other text fields? •  How did you implement multi-valued fields? •  How does it scale? •  How do you handle names not in English? •  How does this relate to the theme of Entity-Centric

Search? •  How do plug-in’s scores relate to Solr scores? •  How can I learn more?

Page 24: Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

Simple Fuzzy Name Matching in Solr Chris Mack

Director Customer Engineering Basis Technology