phonetic algorithms os_bridge_2015

35
WHAT’S IN A NAME? PHONETIC ALGORITHMS FOR SEARCH AND SIMILARITY Mercedes Coyle @benzobot Data Infrastructure Engineer

Upload: mercedes-coyle

Post on 11-Apr-2017

116 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Phonetic algorithms os_bridge_2015

W H AT ’ S I N A N A M E ?P H O N E T I C A L G O R I T H M S F O R S E A R C H A N D S I M I L A R I T Y

Mercedes Coyle @benzobot

Data Infrastructure Engineer

Page 2: Phonetic algorithms os_bridge_2015

W H AT I ’ M G O I N G T O C O V E R T O D AY

• Search - how does it work?

• Phonetic Algorithms

• Use cases for Phonetic Algorithms

Page 3: Phonetic algorithms os_bridge_2015

W H E N W E T H I N K O F S E A R C H …

Page 4: Phonetic algorithms os_bridge_2015

H O W D O E S G O O G L E S E A R C H W O R K ?

• Web crawling on a very large scale!

• Document rank (importance) and similarity

• Text analysis

image credit: flickr.com/photos/rserrano/

Page 5: Phonetic algorithms os_bridge_2015

• Obligatory Hand-wavey “Big Data” comment here

H O W D O E S G O O G L E S E A R C H W O R K ?

image credit: twitter.com/wtrsld/status/424364245648564226

Page 6: Phonetic algorithms os_bridge_2015

D ATA B A S E S E A R C Himage credit: Mercedes Coyle

Page 7: Phonetic algorithms os_bridge_2015

* S Q L

• Comparison search: LIKE operator

• SELECT * FROM table WHERE word LIKE %and%

Page 8: Phonetic algorithms os_bridge_2015

* S Q L

• Comparison search: LIKE operator

• basically a wildcard character search

• only returns data that contains the search string; does not account for misspelling

• can be expensive on large datasets

Page 9: Phonetic algorithms os_bridge_2015

* S Q L

Page 10: Phonetic algorithms os_bridge_2015

E L A S T I C S E A R C H - T O K E N I Z AT I O N

• Used in full-text search against a corpus of text

• “The quick brown fox jumped over the lazy dog”

• the, quick, brown, fox, jump, over, lazy, dog

Page 11: Phonetic algorithms os_bridge_2015

• Wildcard searches return too many results

• Typos or misspelled names don’t return correct results

• exp: “Shawn” vs “Sean”

P R O B L E M : T E X T- B A S E D S E A R C H E S D O N ’ T A LW AY S W O R K W E L L W I T H N A M E S

Page 12: Phonetic algorithms os_bridge_2015

W H AT I S A P H O N E M E ?

• In language, the smallest unit that conveys distinct meaning

• Includes single letters, letter combinations, vowels and consonants

Page 13: Phonetic algorithms os_bridge_2015

E N G L I S H P H O N E M E S

Page 14: Phonetic algorithms os_bridge_2015

H O W D O W E T R A N S L AT E P H O N E M E S C O D E ?

image credit: demoons.com/2010/09/first-animation-test.html

Page 15: Phonetic algorithms os_bridge_2015

P H O N E T I C A L G O R I T H M S

• A method of hashing words and names based on sounds (phonemes).

Page 16: Phonetic algorithms os_bridge_2015

P H O N E T I C A L G O R I T H M T Y P E S

• Soundex

• NYSIIS

• Metaphone and Double Metaphone

• Match Rating, Daitch-Mokotoff Soundex, Kölner Phonetik, Caverphone…

Page 17: Phonetic algorithms os_bridge_2015

S O U N D E X

• Designed in the 1900’s to encode names for the US Census

• Built in to PostgreSQL and MySQL

Page 18: Phonetic algorithms os_bridge_2015

S O U N D E X A L G O R I T H M

Mercedes = MERCEDES

MERCEDES = M0620302

{ 0 : [’A’, E', 'I', 'O', 'U', 'H', 'W', ‘Y’], 1 : [ 'B', 'F', 'P', ‘V’], 2 : ['C', 'G', 'J', 'K', 'Q', 'S', 'X', ‘Z’], 3 : [‘D’,’T’], 4 : [‘L’], 5 : [‘M’,’N’], 6 : [‘R’] }

M0620302 = M6232

M6232 = M623

Page 19: Phonetic algorithms os_bridge_2015

S O U N D E X L I M I TAT I O N S

• Most implementations work for English Language only

• First letter retention causes no match on some similar names

Page 20: Phonetic algorithms os_bridge_2015

S O U N D E X L I M I TAT I O N S

• Postgres Soundex implementation has limited character encoding support

http://www.postgresql.org/docs/9.4/static/fuzzystrmatch.html

Page 21: Phonetic algorithms os_bridge_2015

N Y S I I S

• Developed in 1970, part of New York State Identification and Intelligence System

• Slightly improved functionality over Soundex

Page 22: Phonetic algorithms os_bridge_2015

N Y S I I S A L G O R I T H M

Page 23: Phonetic algorithms os_bridge_2015

N Y S I I S A L G O R I T H M

• MERCEDES

• MARCADAS

• MARCADA

• MARCAD

Page 24: Phonetic algorithms os_bridge_2015

N Y S I I S

Page 25: Phonetic algorithms os_bridge_2015

M E TA P H O N E

• Developed in 1990 by Lawrence Philips

• Improved accuracy over Soundex and NYSIIS

• Double Metaphone implements two hashes for each name or word

Page 26: Phonetic algorithms os_bridge_2015

M E TA P H O N E

Page 27: Phonetic algorithms os_bridge_2015

M E TA P H O N E

• Metaphone and Double Metaphone were improved upon in Metaphone 3, which is unfortunately closed source.

Page 28: Phonetic algorithms os_bridge_2015

P H O N E T I C A L G O R I T H M S I N P R A C T I C E

• Use cases for Phonetic Algorithms

• Example uses in Databases

Page 29: Phonetic algorithms os_bridge_2015

P H O N E T I C A L G O R I T H M S I N P R A C I T C E

• Phonetic algorithms are useful for searching by name or word, and tolerate some misspelling.

Page 30: Phonetic algorithms os_bridge_2015

P H O N E T I C A L G O R I T H M S I N P R A C I T C E

• Store the phonetic hash of a name in fields/columns in your db for indexing and querying

{ "_id" : ObjectId("53e13a73cbcc7a0a6e3078e5"), "first_name" : "Arya", "last_name" : “Stark", "n_first_name" : “AR", "n_last_name" : “STARC”, “report” : “lost_item”, “item” : “ID Card”, "timestamp" : 1407269491, "report_id" : 50642 }

Page 31: Phonetic algorithms os_bridge_2015

P H O N E T I C S E A R C H W I T H E L A S T I C S E A R C H

• Elasticsearch has support for Phonetic Matches, in many different languages!

• Store words/names as documents, and hashing is done at query time

GET /my_index/_analyze?analyzer=dbl_metaphone

returns: Smith Smythe

Page 32: Phonetic algorithms os_bridge_2015

P H O N E T I C S E A R C H U S I N G E L A S T I C S E A R C H

• As a Developer, I really like using Elasticsearch!

• But as a System Administrator, I have battle scars.

Page 33: Phonetic algorithms os_bridge_2015

P H O N E T I C A L G O R I T H M S F O R N O N E N G L I S H L A N G U A G E S

• Grab a linguist and write one?

image credit: flickr.com/photos/opacity

Page 34: Phonetic algorithms os_bridge_2015

R E S O U R C E S

• Libraries

• clj-fuzzy: yomguithereal.github.io/clj-fuzzy/

• python soundex: pypi.python.org/pypi/soundex/1.1.3

• python fuzzy: pypi.python.org/pypi/Fuzzy

• elasticsearch phonetic matching https://www.elastic.co/guide/en/elasticsearch/guide/current/phonetic-matching.html

• http://aspell.net/metaphone/dmetaph.cpp

• Reading:

• http://doughellmann.com/2012/03/03/using-fuzzy-matching-to-search-by-sound-with-python.html

• Fluency, Jen Feohner Wells - http://www.jenniferfoehnerwells.com/fluency.html

Page 35: Phonetic algorithms os_bridge_2015

T H A N K S F O R L I S T E N I N G !

QUEST IONS?

Mercedes Coyle @benzobot

image credit: Mercedes Coyle