an algorithm to align words for historical comparison

24
An Algorithm to Align Words for Historical Comparison Michael A. Covington (The University of Georgia) Journal of Computational Linguistics 1996 February 09, 2011 Hyojin Song

Upload: saxton

Post on 27-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

An Algorithm to Align Words for Historical Comparison. Michael A. Covington (The University of Georgia) Journal of Computational Linguistics 1996. February 09, 2011 Hyojin Song. Contents. Introduction Algorithm Experiment Conclusion. Introduction. 음성인식을 좀 더 잘 할 수 없을까 ??. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Algorithm to Align Words for Historical Comparison

An Algorithm to Align Wordsfor Historical Comparison

Michael A. Covington(The University of Georgia)

Journal of Computational Linguistics 1996

February 09, 2011Hyojin Song

Page 2: An Algorithm to Align Words for Historical Comparison

Contents

Introduction

Algorithm

Experiment

Conclusion

2 / 24

Page 3: An Algorithm to Align Words for Historical Comparison

3 / 24

Introduction

Mami’s here!!

Maumi…… ..Why…...mam……..

음성인식을 좀 더 잘 할

수 없을까 ??

Page 4: An Algorithm to Align Words for Historical Comparison

Introduction

The goal of this paper is to apply the comparative method to a pair of words suspected of being cognate

An algorithm for finding probably correct alignments on the basis of phonetic similarity› An evaluation metric› A guided search procedure

4 / 24

For example, the correct alignment of Latin “do” with Greek “didomi” is

Page 5: An Algorithm to Align Words for Historical Comparison

Introduction

The segments of two words may be misaligned› affixes (living or fossilized)› reduplication› sound changes› elision› Monophthongization

5 / 24

Motivation› A guided search algorithm for finding the best alignment of

one word with another• Both words are given in a broad phonetic transcription• Only see surface forms, not sound laws or phonological rules

Page 6: An Algorithm to Align Words for Historical Comparison

Contents

Introduction

Algorithm› Alignment

› The Search Space

› The Full Evaluation Metric

Experiment

Conclusion

6 / 24

Page 7: An Algorithm to Align Words for Historical Comparison

Algorithm

Alignment Inexact string matching

› Same words are only exact string matching› Finding the alignment that minimizes the difference between

the two words

Dynamic programming algorithm› Well known for inexact string matching› However we do not use it, for several reasons

• The string being aligned are relatively short» The efficiency of dynamic programming on long strings is not needed

• It gives only one alignment for each pair of strings, not n best al-ternatives

7 / 24

a b cb d e

a b ─ c─ b d e

Page 8: An Algorithm to Align Words for Historical Comparison

Algorithm

Alignment An alignment can be viewed as

› A way of stepping through two words concurrently› Consuming all the segments of each

The aligner can perform either a match or skip› A match: when the aligner consumes a segment from each of

the two words in a single step› A skip: when it consumes a segment from one word while

leaving the other word alone

8 / 24

a b ─ c─ b d e

Page 9: An Algorithm to Align Words for Historical Comparison

Algorithm

Alignment The aligner is not allowed to perform

› In succession, a skip on one string and then a skip on the other

• Because the result would be almost equivalent to a match

• This restriction is called as the no-alternating-skips rule

To identify the best alignment, the algorithm must as-sign a penalty to every skip or match› The best alignment is the one with the lowest total penalty.

9 / 24

Page 10: An Algorithm to Align Words for Historical Comparison

Algorithm

Alignment We can use the following penalties:

10 / 24

Then the possible alignments of Spanish el and French le (phonetically [l ∂]) are:

혜원이 두루마리

성위꺼 두루마리

지범이 두루마리

Page 11: An Algorithm to Align Words for Historical Comparison

Algorithm

The Search Space Every possible pair of alignments between words can

be presented as the form of a tree

For example› Word ‘has’ (English [haez] and German [hat])› We know that these words correspond Segment-by-segment, but the aligner doesn’t› The best alignment

11 / 24

Page 12: An Algorithm to Align Words for Historical Comparison

Algorithm

The Search Space Several Rules

› The aligner tries first a match, then a skip on the first word, then a skip on the second, and computes all the consequences of each

› After completing each alignment, It backs up to the most recent untried alternative

› “Dead end” in the tree are places where further computation is blocked by the no-alternating-skip rule

12 / 24

Page 13: An Algorithm to Align Words for Historical Comparison

Algorithm

The Search Space As should be evident, the search tree can be quite large

› Even if the words being aligned are fairly short

Table 1 gives the number of possible alignments for words of various lengths› When both words are of length n, there are about 3n-1 alignments.› Without the no-alternating-skip rule,› The number would be about 5n/2

Fortunately, the aligner can greatly Narrow the search

› To abandon any branch of the search tree as soon as the accumulated penalty exceeds the total penalty of the best alignment found so far 13 /

24

Page 14: An Algorithm to Align Words for Historical Comparison

Algorithm

The Search Space The search tree after pruning

› The total amount of work is roughly cut in half› With larger trees, the saving can be even greater

It is important, at each stage, to try matches before trying skips

› Otherwise the aligner would start by generating a large number of useless displacements of each string relative to the other

14 / 24

Page 15: An Algorithm to Align Words for Historical Comparison

Algorithm

The Full Evaluation Metric Table 2 shows an evaluation metric

› Developed by trial and error› Using the 82 cognate pairs

For example› Maumi VS Mami

15 / 24

m a u m im a - m i

0 + 5 + 50 + 0 + 5= 60

Page 16: An Algorithm to Align Words for Historical Comparison

Contents

Introduction

Algorithm

Experiment› Results on Actual Data

Conclusion

16 / 24

Page 17: An Algorithm to Align Words for Historical Comparison

Experiment

Results on Actual Data Table 3 to 10 show how the aligner performed on 82

cognate pairs in various languages› Tables 5-8 are loosely based on the Swadesh word lists of

Ringe 1992

Table 3, 4: test set of Spanish-French cognate pairs› This test set is chosen because they are historically close but

phonologically very different› The aligner performed almost flawlessly

17 / 24

Page 18: An Algorithm to Align Words for Historical Comparison

Experiment

Results on Actual Data Table 5, 6: test set of English and German cognate

pairs› With English and German it did almost as well› The s in this is aligned with the wrong s in dieses because

that alignment gave greater phonetic similarity› Taking off the inflectional ending would have prevented this

mistake

18 / 24

Page 19: An Algorithm to Align Words for Historical Comparison

Experiment

Results on Actual Data Table 7, 8: test set of English and Latin cognate pairs

› They are much harder to pair up• Since they are separated by millennia of phonological and mor-

phological change, including Grimm’s Law

› Nonetheless, the aligner did reasonably well with them, cor-rectly aligning

› Although it found the correct alignment of fish with piscis, it could not distinguish it from three alternatives

19 / 24

Page 20: An Algorithm to Align Words for Historical Comparison

Experiment

Results on Actual Data Table 9: test set of Fox-Menomini cognate pairs

› Table 9 shows that the algorithm works well with non-Indo-European languages

› Apart from some minor trouble with the suffix of the first item, the aligner had smooth sailing

20 / 24

Page 21: An Algorithm to Align Words for Historical Comparison

Experiment

Results on Actual Data Table 10: test set of other languages cognate pairs

› Table 10 shows how the aligner fared with some word pairs involving Latin, Greek, Sanskrit, and Avestan, again without knowledge of morphology

› Because it knows nothing about place of articulation or Grimm’s Law, it cannot tell whether the d in daughter corre-sponds with the th or the g in Greek thugater

21 / 24

Page 22: An Algorithm to Align Words for Historical Comparison

Contents

Introduction

Algorithm

Experiment

Conclusion

22 / 24

Page 23: An Algorithm to Align Words for Historical Comparison

Conclusion

An algorithm for finding probably correct alignments on the basis of phonetic similarity› An evaluation metric› A guided search procedure

This alignment algorithm and its evaluation metric are, in effect, a formal reconstruction of something that historical linguists do intuitively.

Extended algorithm would be to enable the aligner to recognize assimilation, metathesis, and even redupli-cation› can assign lower penalties to words than to arbitrary mis-

matches23 / 24

Page 24: An Algorithm to Align Words for Historical Comparison

Thank You!

Any question or comment?