n e-meld conference, 15-18 july 2004 1 fixing a legacy lexicon mike maxwell [email protected]...

10
E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell [email protected] University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3600 Market Street, Philadelphia, PA 19104 U.S.A. www.ldc.upenn.edu

Upload: todd-morgan

Post on 18-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 1

Fixing a Legacy Lexicon

Mike Maxwell [email protected]

University of PennsylvaniaLinguistic Data Consortium and Department of Linguistics

3600 Market Street, Philadelphia, PA 19104 U.S.A.

www.ldc.upenn.edu

Page 2: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 2

The Problem• Shoebox lexicon of Mawukakan

– Inconsistencies:

» Inconsistencies among POSs etc.(fixable in Shoebox)

» Spelling errors: English, French and Mawu(import into Word, use English and French spell correctors)

» Errors in hierarchy:Missing fieldsMis-ordered fields

» Missing reciprocal cross-references

• Absolutely typical of Shoebox-style lexicons

• Repairs needed for– Archiving

– Publication

– Export/ import

Page 3: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 3

Old Solution

• Parse until error, characterize error, find error in Shoebox, fix error…

• Find all errors, send list to user, user fixes them, re-do…

Page 4: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 4

Partial solutions

• Inconsistencies among POSs etc.– Fixable in Shoebox

– Helpful addition: counts of POS tokens

• Spelling errors– Import into Word with automatic marking of language, use English and

French spell correctors to fix errors, export back to Shoebox

– No solution for Mawu spelling(n-grams)

• Missing cross-references– Easy to find with shell script, send list to users

– Would be better to mark errors in lexicon

• Missing bi-directional references

Page 5: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 5

Partial solutions

• Errors in hierarchy\w ba’el\pos v.i\ex Yax bo’on ta sna Antonio.\exEn I’m going to Antonio’s house.|\ex Ban yax ba’at?\exEn Where are you going?\exFr Ou allez-vous?

Page 6: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 6

Repairing the hierarchy

• Solution: special purpose parser, mark SFM file with errors and suggested fixes

• Need hierarchyCannot (reliably) extract hierarchy from Shoebox typ file

• User or consultant must provide definition of hierarchy, as regex:(w ( (pos defn (ex exEn exFr)* (syn)?) | (num pos defn (ex exEn exFr)* (syn)?)+ ))– Tool to extract a list of all occurring record/ field patterns

Page 7: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 7

Sample output

• regex … (ex exEn exFr)*…

• Input…\ex Yax bo’on ta sna Antonio.\exEn I’m going to Antonio’s house.|\ex Ban yax ba’at?\exEn Where are you going?\exFr Ou allez-vous?

• Output: …\ex Yax bo’on ta sna Antonio.\exEn I’m going to Antonio’s house.|\exFr ***Missing field inserted***\ex Ban yax ba’at?\exEn Where are you going?\exFr Ou allez-vous?

Page 8: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 8

More sample output

• Input\w yax\pos AUX-V\pos Adj \defn green

• Output\w yax\pos AUX-V\pos Adj ***Erroneous field*** \defn green

Page 9: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 9

More sample output

• Input\w yax\pos AUX-V\foo bar\degn green

• Output\w yax\error ***Unable to parse record structure***\pos AUX-V\foo bar \degn green

Page 10: N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium

E-MELD Conference, 15-18 July 2004 10

The next language

• Nahuatl lexicon– 11,000 entries

– 5000 record/ field patterns

– 147 SFMs…