pair hidden markov model for named entity matching

Pair Hidden Markov Model for Named Entity Matching

Peter Nabende, Jörg Tiedemann, John NerbonneDepartment of Computational Linguistics

Center for Language and Cognition Groningen,

University of Groningen, Netherlands

{p.nabende, j.tiedemann, j.nerbonne}@rug.nl

Introduction• Three types of named entities: entity names, temporal

expressions and number expressions– Entity names refer to organization, person, and location names

• There exists many entity names across different languages necessitating proper handling

• Bi-lingual lexicons comprise a very tiny percentage of entity names

• An MT system would perform poorly for unseen entity names that have translations or transliterations in a target language

• In addition to MT, similarity measurement between cross-lingual entity names is important for CLIR and CLIE applications

Recent Work on Named entity matching• Divided into approaches that consider phonetic information and those

that do not• Lam et al. (2007) argue that many NE translations involve both

semantic and phonetic clues– Their approach is formulated as a bipartite weighted graph matching problem

• Hsu et al. (2006) measure the similarity between two transliterations by comparing physical sounds– A Character Sound Comparison (CSC) method is used that involves the

construction of a speech sound similarity database and a recognition stage

• Pouliquen et al. (2006) compute similarity between pairs of names using letter n-gram similarity without using phonetic transliterations

• We propose the use of a pair-HMM that has been successfully used for Dutch dialect similarity measurement by Wieling et al. (2007), and for word similarity by Mackay and Kondrak (2005)

pair-HMM• The pair-HMM belongs to a family of models called Hidden Markov

Models (HMMs)• The pair-HMM originates from work on Biological Sequence Analysis by

Durbin et al. (1998)• Difference with standard HMMs lies in the observation of a pair of

sequences or a pairwise alignment instead of a single sequence (Fig. 1 and Fig. 2)

Fig. 1: An Instantiation of standard HMM

Fig. 2: An Instantiation of pair-HMM

pair-HMM used in previous NLP Tasks

X1-ε- τXY -λ

1-2δ- τM

1-ε- τXY -λ

Fig. 3: pair-HMM used in previous work (Wieling et al. (2007) ; Mackay and Kondrak (2005))

Proposed pair-HMM

τyλy

X1-εx- τx –λx

1-δx- δx-τm

1-εy- τY –λy

Fig. 4: Diagram illustrating proposed modifications to parameters of the pair-HMM

Name Matching using pair-HMM• The pair-HMM is used to compute similarity scores for two

input sequences of strings• The similarity scores can be used for different purposes

– In this paper, for identification of pairs of highly similar strings

• The model uses the values of initial, transition, and emission parameters that can be determined through a training process

• Example on next slide illustrates different parameters required for computing the similarity scores

Name Matching using pair-HMM

“peter” p e t e r

“пётр” п ё т р

State sequence M M M X M END

score = P(M0) * P(e(p:п) * P(M-M) * P(e:ё) * P(M-M) * P(t:т) * P(M-X) * P(e:_) * P(X-M) * P(r:р) * P(M-END)

TABLE 1

Illustration of an alignment between same name representation in different languages

• Equation illustrates the parameters needed to calculate the score for the alignment above

Parameter estimation for pair-HMMs

• Arribas-Gil et al. (2005) reviewed different parameter estimation approaches for pair-HMMs: – numerical maximization approaches, and Expectation Maximization

(EM) algorithm with its variants (Stochastic EM, Stochastic Approximation EM)

• An EM approach using the Baum-Welch algorithm had already been implemented and is maintained for the pair-HMMs that have been adapted in this work

pair-HMM training software• Wieling et al.s’ (2007) pair-HMM training software was

adapted

• The software was modified to consider use of different alphabets

• Alphabets are generated automatically from the available data set that is to be used for training– For English-Russian dataset, we obtained 76 symbols for the English

language alphabet and 61 symbols for the Russian language alphabet

– For English-French dataset, we had 57 symbols for both languages

• Another modification to Wieling’s version of the pair-HMM training software was converting the software so that it uses less files having the names to be used for training

pair-HMM Training Data• Training data comprises pairs of names from two different

languages

• English-French and English-Russian name pairs were obtained from the GeoNames data dump and Wikipedia data dump– full names with spaces in between were not considered, if there were

full names, they had to be split and used with their corresponding matches in the other language

• For the entity name matching task, 850 distinct English-French pairs of names were extracted, and 5902 distinct English-Russian pairs of names were extracted. – For English-French, 600 pairs were used for training (282 iterations)

– For English-Russian, 4500 pairs were used for training (848 iterations)

pair-HMM scoring algorithms• Two algorithms implemented in the pair-HMM have been

used for scoring:– Forward algorithm

• Takes all possible alignments into account to calculate the probability of the observation sequence given the model

– Viterbi algorithm• Considers only the best alignment when calculating the probability of the

observation sequence given the model

– There are also log versions for the two algorithms that compute the log value for the probability of the observation sequence

Evaluation Measures• Two measures have been considered for evaluating the pair-HMM

algorithms:

Average Reciprocal Rank (ARR)

Equations for Average Rank (AR) and ARR follow from Voorhees and Tice (2000):

iAR R i

• The computation for ARR, however, depends on the complexity of the evaluation set

i English French Rank (R(i))

klausen chiusa 5

kraków cracovie 1

TABLE 2RANKING EXAMPLE AFTER USING FORWARD-LOG ALGORITHM

Evaluation MeasuresCross Entropy (CE)

Used to compare the effectiveness of different language models and useful when we do not know actual probability that generated some data

For the pair-HMM, CE is specified by:

1 1 1 1( , )

1( , ) lim ( : ,..., : ) log ( : ,..., : )le le le le

lex V y V

H p m p x y x y m x y x yle

which is approximated to:

1 1( , )

1 1( ) log ( : ,..., : )le le

x V y V

H m m x y x yn le

Results ARR Results

Algorithm ARR (for N = 164)

Viterbi-log 0.8099

Forward-log 0.8081

TABLE 3ARR RESULTS FOR ENGLISH-FRENCH DATA

Algorithm ARR (for N = 966)Viterbi 0.8536Forward 0.8546Viterbi-log 0.8359Forward-log 0.8355

TABLE 4ARR RESULTS FOR ENGLISH-RUSSIAN DATA

Results• ARR results show no significant difference between the

accuracy of the two algorithms

Cross Entropy Results

Algorithm name-pair CE (for n = 1000)Viterbi 32.2946Forward 32.2009

TABLE 5CROSS ENTROPY RESULTS FOR ENGLISH-RUSSIAN

• There is no significant difference in the accuracy of the Viterbi and Forward algorithms

Conclusion• A pair-HMM has been introduced for application in matching

entity names

• The evaluation carried out so far is not sufficient to give critical information regarding the performance of the pair-HMM

• The results show no significant differences between the Viterbi and Forward algorithms

• However, ARR results from the experiments are encouraging

• It is feasible to use the pair-HMM in the generation of transliterations

Future Work

• It should be interesting to create other structures associated with the pair-HMM; for example, so as to incorporate contextual information

• The pair-HMM needs to be evaluated against other models– Alignment-based discriminative string similarity as proposed in

Bergsma and Kondrak(2007) for the task (cognate identification) will be considered

THANKS !

Questions ?

References1. W. Lam, S-K. Chan and R. Huang, “Named Entity Translation Matching and Learning: With

Application for Mining Unseen Translations,” ACM Transactions on Information Systems, vol. 25, issue 1, article 2, 2007.

2. C-C. Hsu., C-H. Chen, T-T. Shih and C-K. Chen, “Measuring Similarity between Transliterations against Noise and Data,” ACM Transactions on Asian Language Information Processing, vol. 6, issue 2, article 5, 2005.

3. M. Wieling, T. Leinonen and J. Nerbonne, “Inducing Sound Segment Differences using Pair Hidden Markov Models. In J. Nerbonne, M. Ellison and G. Kondrak (eds.), Computing and Historical Phonology: 9th Meeting of ACL Special Interest Group for Computational Morphology and Phonology Workshop, Prague, pp. 48-56, 2007.

4. W. Mackay and G. Kondrak, “Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models,” Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL), pp. 40-47, Ann Arbor, Michigan, 2005.

5. R. Durbin, S.R. Eddy, A. Krogh and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Protein and Nucleic Acids. Cambridge University Press, 1998.

6. A. Arribas-Gil, E. Gassiat and C. Matias, “Parameter Estimation in Pair-hidden Markov Models,” Scandinavian Journal of Statistics, vol. 33, issue 4, pp. 651-671, 2006.

7. E.M. Voorhees and D.M. Tice. The TREC-8 Question Answering Track Report. In English Text Retrieval Conference (TREC-8), 2000.

8. C-J. Lee, J.S. Chang and J-S.R. Juang. A Statistical Approach to Chinese-to-English Back Transliteration. In Proceedings of the 17th Pacific Asia Conference, 2003.

9. S. Bergsma and G. Kondrak. Alignment-Based Discriminative String Similarity. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages. 656-663, Prague, Czech Republic, June 2007.

pair hidden markov model for named entity matching

pairhmmthe pair

pair of sequences

instantiation of pair

proposed pairhmmfig

compute similarity

word similarity

pair hidden markov model

unseen entity names

Documents

markov models and hidden markov models (hmms)

9 markov chains regular markov chains absorbing markov...

hidden markov models -...

finite horizon markov decision problems...

markov chains and hidden markov models - cornell...

change entity quick start status - united states patent …...

markov chains and stationary...

applications of markov logic. overview basics logistic...

modèle de markov cachée hidden markov model (hmm)

(combinatorics of) alignment and gene finding lior pachter...

classical and quantum markov...

pair hidden markov model

chapter 9: markov chain regular markov chains section...

markov chains. summary markov chains discrete time markov...

cea.nic.in · 2020. 7. 20. · bhedaghat exchange — 800...

hidden markov models€¦ · •hidden markov models 1 the...

introduction to...

pair hidden markov model for named entity matching peter...

vet program completion rates 2016 - ncver.edu.au · web...

identification of bilingual named entities from wikipedia...