an algorithm to align words for historical comparison

An Algorithm to Align Wordsfor Historical Comparison

Michael A. Covington(The University of Georgia)

Journal of Computational Linguistics 1996

February 09, 2011Hyojin Song

Contents

Introduction

Algorithm

Experiment

Conclusion

2 / 24

3 / 24

Introduction

Mami’s here!!

Maumi…… ..Why…...mam……..

음성인식을 좀 더 잘 할

수 없을까 ??

Introduction

The goal of this paper is to apply the comparative method to a pair of words suspected of being cognate

An algorithm for finding probably correct alignments on the basis of phonetic similarity› An evaluation metric› A guided search procedure

4 / 24

For example, the correct alignment of Latin “do” with Greek “didomi” is

Introduction

The segments of two words may be misaligned› affixes (living or fossilized)› reduplication› sound changes› elision› Monophthongization

5 / 24

Motivation› A guided search algorithm for finding the best alignment of

one word with another• Both words are given in a broad phonetic transcription• Only see surface forms, not sound laws or phonological rules

Contents

Introduction

Algorithm› Alignment

› The Search Space

› The Full Evaluation Metric

Experiment

Conclusion

6 / 24

Algorithm

Alignment Inexact string matching

› Same words are only exact string matching› Finding the alignment that minimizes the difference between

the two words

Dynamic programming algorithm› Well known for inexact string matching› However we do not use it, for several reasons

• The string being aligned are relatively short» The efficiency of dynamic programming on long strings is not needed

• It gives only one alignment for each pair of strings, not n best al-ternatives

7 / 24

a b cb d e

a b ─ c─ b d e

Algorithm

Alignment An alignment can be viewed as

› A way of stepping through two words concurrently› Consuming all the segments of each

The aligner can perform either a match or skip› A match: when the aligner consumes a segment from each of

the two words in a single step› A skip: when it consumes a segment from one word while

leaving the other word alone

8 / 24

a b ─ c─ b d e

Algorithm

Alignment The aligner is not allowed to perform

› In succession, a skip on one string and then a skip on the other

• Because the result would be almost equivalent to a match

• This restriction is called as the no-alternating-skips rule

To identify the best alignment, the algorithm must as-sign a penalty to every skip or match› The best alignment is the one with the lowest total penalty.

9 / 24

Algorithm

Alignment We can use the following penalties:

10 / 24

Then the possible alignments of Spanish el and French le (phonetically [l ∂]) are:

혜원이 두루마리

성위꺼 두루마리

지범이 두루마리

Algorithm

The Search Space Every possible pair of alignments between words can

be presented as the form of a tree

For example› Word ‘has’ (English [haez] and German [hat])› We know that these words correspond Segment-by-segment, but the aligner doesn’t› The best alignment

11 / 24

Algorithm

The Search Space Several Rules

› The aligner tries first a match, then a skip on the first word, then a skip on the second, and computes all the consequences of each

› After completing each alignment, It backs up to the most recent untried alternative

› “Dead end” in the tree are places where further computation is blocked by the no-alternating-skip rule

12 / 24

Algorithm

The Search Space As should be evident, the search tree can be quite large

› Even if the words being aligned are fairly short

Table 1 gives the number of possible alignments for words of various lengths› When both words are of length n, there are about 3n-1 alignments.› Without the no-alternating-skip rule,› The number would be about 5n/2

Fortunately, the aligner can greatly Narrow the search

› To abandon any branch of the search tree as soon as the accumulated penalty exceeds the total penalty of the best alignment found so far 13 /

24

Algorithm

The Search Space The search tree after pruning

› The total amount of work is roughly cut in half› With larger trees, the saving can be even greater

It is important, at each stage, to try matches before trying skips

› Otherwise the aligner would start by generating a large number of useless displacements of each string relative to the other

14 / 24

Algorithm

The Full Evaluation Metric Table 2 shows an evaluation metric

› Developed by trial and error› Using the 82 cognate pairs

For example› Maumi VS Mami

15 / 24

m a u m im a - m i

0 + 5 + 50 + 0 + 5= 60

Contents

Introduction

Algorithm

Experiment› Results on Actual Data

Conclusion

16 / 24

Experiment

Results on Actual Data Table 3 to 10 show how the aligner performed on 82

cognate pairs in various languages› Tables 5-8 are loosely based on the Swadesh word lists of

Ringe 1992

Table 3, 4: test set of Spanish-French cognate pairs› This test set is chosen because they are historically close but

phonologically very different› The aligner performed almost flawlessly

17 / 24

Experiment

Results on Actual Data Table 5, 6: test set of English and German cognate

pairs› With English and German it did almost as well› The s in this is aligned with the wrong s in dieses because

that alignment gave greater phonetic similarity› Taking off the inflectional ending would have prevented this

mistake

18 / 24

Experiment

Results on Actual Data Table 7, 8: test set of English and Latin cognate pairs

› They are much harder to pair up• Since they are separated by millennia of phonological and mor-

phological change, including Grimm’s Law

› Nonetheless, the aligner did reasonably well with them, cor-rectly aligning

› Although it found the correct alignment of fish with piscis, it could not distinguish it from three alternatives

19 / 24

Experiment

Results on Actual Data Table 9: test set of Fox-Menomini cognate pairs

› Table 9 shows that the algorithm works well with non-Indo-European languages

› Apart from some minor trouble with the suffix of the first item, the aligner had smooth sailing

20 / 24

Experiment

Results on Actual Data Table 10: test set of other languages cognate pairs

› Table 10 shows how the aligner fared with some word pairs involving Latin, Greek, Sanskrit, and Avestan, again without knowledge of morphology

› Because it knows nothing about place of articulation or Grimm’s Law, it cannot tell whether the d in daughter corre-sponds with the th or the g in Greek thugater

21 / 24

Contents

Introduction

Algorithm

Experiment

Conclusion

22 / 24

Conclusion

An algorithm for finding probably correct alignments on the basis of phonetic similarity› An evaluation metric› A guided search procedure

This alignment algorithm and its evaluation metric are, in effect, a formal reconstruction of something that historical linguists do intuitively.

Extended algorithm would be to enable the aligner to recognize assimilation, metathesis, and even redupli-cation› can assign lower penalties to words than to arbitrary mis-

matches23 / 24

Thank You!

Any question or comment?

an algorithm to align words for historical comparison

Documents