pair hidden markov model for named entity matching
Post on 04-Jan-2016
49 Views
Preview:
DESCRIPTION
TRANSCRIPT
Pair Hidden Markov Model for Named Entity Matching
Peter Nabende, Jörg Tiedemann, John NerbonneDepartment of Computational Linguistics
Center for Language and Cognition Groningen,
University of Groningen, Netherlands
{p.nabende, j.tiedemann, j.nerbonne}@rug.nl
Introduction• Three types of named entities: entity names, temporal
expressions and number expressions– Entity names refer to organization, person, and location names
• There exists many entity names across different languages necessitating proper handling
• Bi-lingual lexicons comprise a very tiny percentage of entity names
• An MT system would perform poorly for unseen entity names that have translations or transliterations in a target language
• In addition to MT, similarity measurement between cross-lingual entity names is important for CLIR and CLIE applications
2
Recent Work on Named entity matching• Divided into approaches that consider phonetic information and those
that do not• Lam et al. (2007) argue that many NE translations involve both
semantic and phonetic clues– Their approach is formulated as a bipartite weighted graph matching problem
• Hsu et al. (2006) measure the similarity between two transliterations by comparing physical sounds– A Character Sound Comparison (CSC) method is used that involves the
construction of a speech sound similarity database and a recognition stage
• Pouliquen et al. (2006) compute similarity between pairs of names using letter n-gram similarity without using phonetic transliterations
• We propose the use of a pair-HMM that has been successfully used for Dutch dialect similarity measurement by Wieling et al. (2007), and for word similarity by Mackay and Kondrak (2005)
3
pair-HMM• The pair-HMM belongs to a family of models called Hidden Markov
Models (HMMs)• The pair-HMM originates from work on Biological Sequence Analysis by
Durbin et al. (1998)• Difference with standard HMMs lies in the observation of a pair of
sequences or a pairwise alignment instead of a single sequence (Fig. 1 and Fig. 2)
Fig. 1: An Instantiation of standard HMM
Fig. 2: An Instantiation of pair-HMM
4
pair-HMM used in previous NLP Tasks
Y
ε
MEND
X1-ε- τXY -λ
1-2δ- τM
δ
δ
τM
τXY
ε
τXY
1-ε- τXY -λ
xi
xi
yj
yj
λ
λ
Fig. 3: pair-HMM used in previous work (Wieling et al. (2007) ; Mackay and Kondrak (2005))
5
Proposed pair-HMM
τyλy
Y
εy
MEND
X1-εx- τx –λx
1-δx- δx-τm
δx
δy
τm
τx
εx
1-εy- τY –λy
xi
xi
yj
yj
λx
Fig. 4: Diagram illustrating proposed modifications to parameters of the pair-HMM
6
Name Matching using pair-HMM• The pair-HMM is used to compute similarity scores for two
input sequences of strings• The similarity scores can be used for different purposes
– In this paper, for identification of pairs of highly similar strings
• The model uses the values of initial, transition, and emission parameters that can be determined through a training process
• Example on next slide illustrates different parameters required for computing the similarity scores
7
Name Matching using pair-HMM
“peter” p e t e r
“пётр” п ё т р
State sequence M M M X M END
score = P(M0) * P(e(p:п) * P(M-M) * P(e:ё) * P(M-M) * P(t:т) * P(M-X) * P(e:_) * P(X-M) * P(r:р) * P(M-END)
TABLE 1
Illustration of an alignment between same name representation in different languages
• Equation illustrates the parameters needed to calculate the score for the alignment above
8
Parameter estimation for pair-HMMs
• Arribas-Gil et al. (2005) reviewed different parameter estimation approaches for pair-HMMs: – numerical maximization approaches, and Expectation Maximization
(EM) algorithm with its variants (Stochastic EM, Stochastic Approximation EM)
• An EM approach using the Baum-Welch algorithm had already been implemented and is maintained for the pair-HMMs that have been adapted in this work
9
pair-HMM training software• Wieling et al.s’ (2007) pair-HMM training software was
adapted
• The software was modified to consider use of different alphabets
• Alphabets are generated automatically from the available data set that is to be used for training– For English-Russian dataset, we obtained 76 symbols for the English
language alphabet and 61 symbols for the Russian language alphabet
– For English-French dataset, we had 57 symbols for both languages
• Another modification to Wieling’s version of the pair-HMM training software was converting the software so that it uses less files having the names to be used for training
10
pair-HMM Training Data• Training data comprises pairs of names from two different
languages
• English-French and English-Russian name pairs were obtained from the GeoNames data dump and Wikipedia data dump– full names with spaces in between were not considered, if there were
full names, they had to be split and used with their corresponding matches in the other language
• For the entity name matching task, 850 distinct English-French pairs of names were extracted, and 5902 distinct English-Russian pairs of names were extracted. – For English-French, 600 pairs were used for training (282 iterations)
– For English-Russian, 4500 pairs were used for training (848 iterations)
11
pair-HMM scoring algorithms• Two algorithms implemented in the pair-HMM have been
used for scoring:– Forward algorithm
• Takes all possible alignments into account to calculate the probability of the observation sequence given the model
– Viterbi algorithm• Considers only the best alignment when calculating the probability of the
observation sequence given the model
– There are also log versions for the two algorithms that compute the log value for the probability of the observation sequence
12
Evaluation Measures• Two measures have been considered for evaluating the pair-HMM
algorithms:
Average Reciprocal Rank (ARR)
Equations for Average Rank (AR) and ARR follow from Voorhees and Tice (2000):
1
1( )
N
iAR R i
N 1
1 1
( )
N
iARR
N R i
13
• The computation for ARR, however, depends on the complexity of the evaluation set
i English French Rank (R(i))
klausen chiusa 5
kraków cracovie 1
TABLE 2RANKING EXAMPLE AFTER USING FORWARD-LOG ALGORITHM
Evaluation MeasuresCross Entropy (CE)
Used to compare the effectiveness of different language models and useful when we do not know actual probability that generated some data
For the pair-HMM, CE is specified by:
1 2
1 1 1 1( , )
1( , ) lim ( : ,..., : ) log ( : ,..., : )le le le le
lex V y V
H p m p x y x y m x y x yle
which is approximated to:
1 2
1 1( , )
1 1( ) log ( : ,..., : )le le
x V y V
H m m x y x yn le
14
Results ARR Results
Algorithm ARR (for N = 164)
Viterbi-log 0.8099
Forward-log 0.8081
TABLE 3ARR RESULTS FOR ENGLISH-FRENCH DATA
Algorithm ARR (for N = 966)Viterbi 0.8536Forward 0.8546Viterbi-log 0.8359Forward-log 0.8355
TABLE 4ARR RESULTS FOR ENGLISH-RUSSIAN DATA
15
Results• ARR results show no significant difference between the
accuracy of the two algorithms
Cross Entropy Results
Algorithm name-pair CE (for n = 1000)Viterbi 32.2946Forward 32.2009
TABLE 5CROSS ENTROPY RESULTS FOR ENGLISH-RUSSIAN
DATA
16
• There is no significant difference in the accuracy of the Viterbi and Forward algorithms
Conclusion• A pair-HMM has been introduced for application in matching
entity names
• The evaluation carried out so far is not sufficient to give critical information regarding the performance of the pair-HMM
• The results show no significant differences between the Viterbi and Forward algorithms
• However, ARR results from the experiments are encouraging
• It is feasible to use the pair-HMM in the generation of transliterations
17
Future Work
• It should be interesting to create other structures associated with the pair-HMM; for example, so as to incorporate contextual information
• The pair-HMM needs to be evaluated against other models– Alignment-based discriminative string similarity as proposed in
Bergsma and Kondrak(2007) for the task (cognate identification) will be considered
18
THANKS !
Questions ?
19
References1. W. Lam, S-K. Chan and R. Huang, “Named Entity Translation Matching and Learning: With
Application for Mining Unseen Translations,” ACM Transactions on Information Systems, vol. 25, issue 1, article 2, 2007.
2. C-C. Hsu., C-H. Chen, T-T. Shih and C-K. Chen, “Measuring Similarity between Transliterations against Noise and Data,” ACM Transactions on Asian Language Information Processing, vol. 6, issue 2, article 5, 2005.
3. M. Wieling, T. Leinonen and J. Nerbonne, “Inducing Sound Segment Differences using Pair Hidden Markov Models. In J. Nerbonne, M. Ellison and G. Kondrak (eds.), Computing and Historical Phonology: 9th Meeting of ACL Special Interest Group for Computational Morphology and Phonology Workshop, Prague, pp. 48-56, 2007.
4. W. Mackay and G. Kondrak, “Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models,” Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL), pp. 40-47, Ann Arbor, Michigan, 2005.
5. R. Durbin, S.R. Eddy, A. Krogh and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Protein and Nucleic Acids. Cambridge University Press, 1998.
6. A. Arribas-Gil, E. Gassiat and C. Matias, “Parameter Estimation in Pair-hidden Markov Models,” Scandinavian Journal of Statistics, vol. 33, issue 4, pp. 651-671, 2006.
7. E.M. Voorhees and D.M. Tice. The TREC-8 Question Answering Track Report. In English Text Retrieval Conference (TREC-8), 2000.
8. C-J. Lee, J.S. Chang and J-S.R. Juang. A Statistical Approach to Chinese-to-English Back Transliteration. In Proceedings of the 17th Pacific Asia Conference, 2003.
9. S. Bergsma and G. Kondrak. Alignment-Based Discriminative String Similarity. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages. 656-663, Prague, Czech Republic, June 2007.
20
top related