phonetic search for multiple choice questioncsyu/yu_resume 2016_01_08... · phonetic search for...
TRANSCRIPT
Phonetic Search for Multiple Choice Question
By Chung-Hsien (Jacky) Yu
01/06/2016
1
Problem Definition
2
selection list = [ (1,"Montague Expressway, Milpitas, CA"), (2,"5120 North 1st Street, San Jose, CA"), (3,"2870 Zanker Road, San Jose, CA")] query = "Montag"
• Select a string from a given list, which’s sound is similar to the sound of the query string.
• Select by the given index number. (1,2,3) • Select by the ordinal sequence. (Frist, second, last)
Beider-Morse Phonetic Matching
• Encoding of the words by the sound.
• The words with the same sound have the same encoding.
• Recognizing the words written in a different way actually can be phonetically equivalent or sound alike.
• Other encoding methods, such as Soundex, do not include the vowels, a, e, i, …, but BMPM does.
Source: http://stevemorse.org/phonetics/bmpm.htm 3
BMPM Encoding List
Source: http://stevemorse.org/phonetics/bmpm.htm 4
Example Example
a Like in part b Like in boy
d Like in dog e Like in set
f Like in flag g Like in dog
h Like in hand i Like in Nice (the city), or ee as in fleet
j Like y in yes, equivalent to German j k Like in king
l Like in lamp m Like in man
n Like in neck o Like in port
p Like in pot r Like in ring
s Like in star t Like in tent
u Like in flu, or oo in good v Like in vase
w Like in wax x Like ch in loch; equivalent to Germanch
z Like in zoo S Like s in sure, or sh in shop
Z Like z in azure; equivalent to French j
BMPM Implementation
5
• http://stevemorse.org/phoneticinfo.htm
• pip install abydos
• from abydos.phonetic import bmpm
BMPM Function
6
bmpm(word, language_arg=0, name_mode='gen', match_mode='approx', concat=False, filter_langs=False): str word: the word to transform str language_arg: the language of the term; supported values str name_mode: the name mode of the algorithm: str match_mode: matching mode: 'approx' or 'exact' bool concat: concatenation mode bool filter_langs: filter out incompatible languages returns: the BMPM value(s) rtype: tuple
BMPM Encoding
7
“starbucks” = ['sterbuks', 'sterbaks‘, 'storbuks', 'storbaks', 'starbuks', 'starbaks']
bmpm(‘Starbucks’, 'english', 'gen', 'exact', False, True).split(" ")
The combinations of possible pronunciations
BMPM Combination Codes
8
“Starbucks” = ['sterbuks', 'sterbaks', 'storbuks', 'storbaks', 'starbuks', 'starbaks']
“startbuck” = ['sterdbuk', 'sterdbak', 'stordbuk' ,'stordbak', 'stardbuk', 'stardbak']
Comparing two list of codes to find the similarity between two words
Comparing Two Codes
9
• Levenshtein distance : The minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein Distance
10
"kitten" and "sitting" = 3 1. kitten → sitten (substitution of "s" for "k") 2. sitten → sittin (substitution of "i" for "e") 3. sittin → sitting (insertion of "g" at the end).
https://en.wikipedia.org/wiki/Levenshtein_distance
Python Levenshtein Distance
11
pip install python-levenshtein import Levenshtein
similarity = Levenshtein.ratio(string1, string2) • Compute similarity of two strings. • The similarity is a number between 0 and 1. • 1 means that they are the same string.
Comparing Two Code Lists
12
Q.str1 Q.str2 Q.str3
S.str1 1 0.4 0.6
S.str2 0.5 0.9 0.8
S.str3 0.6 0.3 0.7
Max. 1 0.9 0.8 Avg=0.9
Query = [str1, str2. str3]
S=
[str1, str2
, str3]
0.9 is the matching score between Q and S.
Matching the Query
13
• The selection with the highest matching score is chosen as the best match with the query.
• Returning the index of the selection. • If the highest matching score is lower than a
threshold, it is an indecisive choice returning None.
The Numbers
14
“2870 Zanker Road” = ['zenker', 'zonker', 'zanker', 'rout']
All the numbers got ignored !!! All the numbers got ignored !!!
Convert the Numbers to Strings
15
“2870 Zanker Road” = “two thousand, eight hundred and seventy Zanker Road”
pip install num2words from num2words import num2words str = num2words(int) str = num2words(int, ordinal=True)
Why Number to String?
16
• Converting the numbers in both the query and selection strings for consistency.
• Allow select by the numbers included in the string. • The query string can use number, ‘1’, ’2’,.. , or ‘one’,
‘two’,… for selection. • Could be extended to select by the ordinal sequence.
(Frist, second, last)
Select by Index
17
selection list = [ (6,"Montague Expressway, Milpitas, CA"), (7,"5120 North 1st Street, San Jose, CA"), (8,"2870 Zanker Road, San Jose, CA")] query = “number 6“ or “number six“
Add the index to the string: [ (“six Montague Expressway, Milpitas, CA"), (“seven 5120 North 1st Street, San Jose, CA"), (“eight 2870 Zanker Road, San Jose, CA")] query = “number six“
Select by Order
18
selection list = [ (6,"Montague Expressway, Milpitas, CA"), (7,"5120 North 1st Street, San Jose, CA"), (8,"2870 Zanker Road, San Jose, CA")] query = “the first one“
Add the ordinal index to the string: [ (“first Montague Expressway, Milpitas, CA"), (“second 5120 North 1st Street, San Jose, CA"), (“third last 2870 Zanker Road, San Jose, CA")] query = “the first one“