morphologymorpheme 1/2 §grammatical morpheme in a predicative word phrase: (word phrase = a...
TRANSCRIPT
(c) Key-Sun Choi for Pan Localization 2005
MorphologyCHOI, Key-Sun
Korea Advanced Institute of Science & Technologyhttp://www.kaist.edu/
Korea Terminology Research Center for Language & Knowledge Engineering
http://www.korterm.org/
2(c) Key-Sun Choi for Pan Localization 2005
Overview
Morphology and its Computational Aspects will be presented. A general view of morphological analysis is introduced with focusing on its components and handling methodologies. An implementation is drawn with unknown word processing. In the context of localization, terminology is crucial for the domain-specific language processing. Two aspects on translation and transliteration will be shown.
3(c) Key-Sun Choi for Pan Localization 2005
Table of ContentsIntroduction: Morphology and Characteristics
Complexity of MorphologyTypes of AmbiguitiesExamples in English, Korean, Chinese, Japanese,
General RepresentationComponents of Morphological AnalysisImplementation SchemeUnknown Words
Terminology and Computation IssuesTranslation (Localization) of New TermsTransliteration
Romanized Form of Local Language
E.g. "data" = "資料" -- translation"deitə" (데이터) -- transliteration
4(c) Key-Sun Choi for Pan Localization 2005
Writing Systems
Segmented Textsa space between two word unitse.g. English
Unsegmented Textsno space in texte.g., Chinese, Japanese
Segmented + Unsegmented Textsa space before content word (noun, verb, adjective) in order to signal the new content word (=new concept)no space before functional word
"a room+in" for an English meaning "in a room"
5(c) Key-Sun Choi for Pan Localization 2005
Unsegmented Text: English (1)
Thisworkaimstoextractpossiblecausalrelationsthatexistbetweennounphrases.Somecausalrelationsaremanifestedbylexicalpatternslikecausalverbsandtheirsub-categorization.Weuselexicalpatternsasafiltertofindcausalitycandidatesandwetransferthecausalityextractionproblemtothebinaryclassification.Tosolvetheproblem,weintroduceprobabilitiesforwordpair,conceptpairandcuephrasethatcouldbeacausalitypattern.Theseprobabilitiesarelearnedfromtherawcorpusinanunsupervisedmanner. (IPM, 2005, D.S. Chang & K.-S. Choi)
6(c) Key-Sun Choi for Pan Localization 2005
Segmented Text: English (2)
This work aims to extract possible causal relations that exist between noun phrases. Some causal relations are manifested by lexical patterns like causal verbs and their sub-categorization. We use lexical patterns as a filter to find causality candidates and we transfer the causality extraction problem to the binary classification. To solve the problem, we introduce probabilities for word pair, concept pair and cue phrase that could be a causality pattern. These probabilities are learned from the raw corpus in an unsupervised manner. (IPM, 2005, D.S. Chang & K.-S. Choi)
7(c) Key-Sun Choi for Pan Localization 2005
Segmented+Unsegmented Text: English (3)
This work aims to+extract possible causal relations that+exist between+noun-phrases. Some causal relations are manifested by+lexical-patterns like+causal-verbs and their sub-categorization. We use lexical patterns as+a-filter to+find causality candidates+ and we transfer the causality extraction problem to+the-binary classification. To+solve the problem, we introduce probabilities for+word-pair, concept-pair+and cue-phrase that+could be a causality pattern. These probabilities are learned from+the-raw-corpus in+an unsupervised-manner. (IPM, 2005, D.S. Chang & K.-S. Choi)
"in+room" (preposition) = "room+in" (postposition)
8(c) Key-Sun Choi for Pan Localization 2005
German-English-Korean-Japanese Words
Fremdsprachenkenntnisseknowledge of foreign languages외국어지식
外国語知識
Naturwissenschaftnatural sciences자연과학
自然科學
German Examples from Benjamin Tsou (ACL 2000)
9(c) Key-Sun Choi for Pan Localization 2005
Computational Morphology
MorphemeA minimal meaningful elements"computational" = "comput+ation+al"
Morphological AnalysisSegmentation
To divide into morphemes or words
To include lemmatization (= to find its stem)
CategorizationTo assign Part-of-speech category (POS tags)
To assign Semantic features
10(c) Key-Sun Choi for Pan Localization 2005
Morphological Analysis: e.g.,
These probabilities are learned from the raw corpus in an unsupervised manner.
Original textThese probability+Plural be+Plural+Present learn+PPfrom the raw corpus in a+Vowel un_supervise+PPmanner .
Lemmatized textThese/pron probability/noun+pluralare/be_verb+plural+present learn/verb_ed/ppfrom/prep the/article raw/adjective corpus/nounin/prep an/article un/prefix_supervis/verb_ ed/pp manner/noun ./period
part-of-speech categorization (POS tagging/Annotation)
11(c) Key-Sun Choi for Pan Localization 2005
Word Formation
Types of LanguagesInflectional
e.g., Latincanta-bo
Analytic - derivatione.g., English I will sing
Agglutinative - concatenatione.g., Korean, Japanese, Hindi, Turkish, na-neun norayha-keyssda
12(c) Key-Sun Choi for Pan Localization 2005
Morphology:how to describe your language
e.g., a Korean case
your native language: (Korean)na+neun norayha+keyssda
Grammar: I+subjective sing+will (future & intention)I+subjective postposition sing+ future&intention ending
meaning: (English)I will sing
13(c) Key-Sun Choi for Pan Localization 2005
Grammatical morphemes(functional words)
Postpositionto express a case feature (or a functional role)
subjective case, objective case, etc.e.g., I sings a song. (I = subjective, a song = objective)
to represent semantic roles of constituentstypically Noun + postposition
Endingto represent features like tenses, aspects, moods and voicesto derive relative clausestypically Verb or Adjective + endings for its Auxiliary Verbs orsentence connectivese.g.,: I song sing+will (I will sing a song.)
14(c) Key-Sun Choi for Pan Localization 2005
Example of Grammatical Morpheme 1/2
Grammatical morpheme in a predicative word phrase: (word phrase = a segmentation unit)
"(someone) went but"
ga syeoss seumnida man가 셨 습니다 만
manseumnidaeosssiga
“go” honorfic Pasttense
Declarativemood concessive
가 시 만었 습니다
Verb AUX Particle Ending Postposition
Predicate Grammatical Morphemes
Predicative word phrase
15(c) Key-Sun Choi for Pan Localization 2005
jib eseo buteo man
집 만
eun
에서 부터 은
집 에서 부터 만 은
jib eseo buteo man eun“house” place origin concessive topical marker
Adverbial postpos.
Example of Grammatical Morpheme 2/2
Grammatical morpheme in a substantive word phrase"(It should be prohibited) from home, even (others cannot do)"
Noun postposition postposition
Grammatical morphemesNoun
Substantive word phrase
16(c) Key-Sun Choi for Pan Localization 2005
Language units
GraphemeAlphabet, a minimal unit
consonant: (e.g.,) b, c, d, f, ...vowel: (e.g.,) a, e, i, o, u, ...
Syllable2 or 3 graphemes form a syllablee.g., ga, sun, wa, ...
Word phrase (or word)Segmentation unit (or spacing unit)it consists of one or more morphemes
word phraseword phrasesyllablesyllable
graphemegraphemeMorphemeMorpheme
17(c) Key-Sun Choi for Pan Localization 2005
Types of Ambiguities- Homonymy -
English examples:lead (metal) lead (past tense is "led")swallow (bird) swallow (through mouth)Bill (name) bill (for payment)
What is the word's part-of-speech?
18(c) Key-Sun Choi for Pan Localization 2005
Types of Ambiguities in Word Phrase 1/2
Types of ambiguity in word phrase analysisAmbiguity in Segmentation + POS tag
gamyeon
Ambiguity in POS taggameulgoto (English simulation)
– go (N or V)
Ambiguity in lemmatizationdowa
gamgam -eul-eul
gamgam -eul-eul
dobdob -a-a
dodo -wa-wa
Verb
Ending
Noun
postposition
N
N p
p
p
N
V
e
eV
V e
gaga -myeon-myeon
ga-myeonga-myeonN
eV
19(c) Key-Sun Choi for Pan Localization 2005
Types of Ambiguities in Word Phrase 2/2
Mixed typega-si-neun
gaga
gasigasi
-si--si- -neun-neuneV f
Verb
Ending
NounN
V
e
Prefinal ending
Postpositionp
f
galgal -si--si- -neun-neuneV f
-neun-neunpN
gasigasi -neun-neuneV
20(c) Key-Sun Choi for Pan Localization 2005
Metrics of Ambiguity
Considering ambiguities Average ambiguities per an word phrase : 3.5 In case of 1-syllable morphems : 2.8 ambiguities averageIn case of 2-syllable morphemes : 1.13 ambiguities average
*Randomly selected 15 morphemes from the most frequent 10000 morpheme
Considering phonetic transformation rulesAbout 20 rules are needed.
More than 50 sub-rules (considering contextual information)
Complexity of n-syllable word phrase segmentationsAt least 2n-1
e.g., n=4 (Korean example)동서남북[dong-seo-nam-buk]
– east-west-south-north동 서 남 북
동 서 남 북
동 남 북서
동 서 남 북
동 서 남 북
남 북동 서
동 서 남 북
동 서 남 북
21(c) Key-Sun Choi for Pan Localization 2005
Typical Ambiguities of Word Phrase Analysis
Categorial Disambiguatione.g., gam+eun (Korean example),
Noun+ParticleVerb stem + Ending
Segmentation Disambiguatione.g., gamgineun
Noun+Particle = gamgi+neunVerb stem + Ending+Particle = gam+gi+neun
Stem Identificatione.g., naneun
Noun+particle = na+neunVerb stem+ending = nal+neun (with verbal ending change)
22(c) Key-Sun Choi for Pan Localization 2005
Types of Verbal Stem Variations (1/2)
Consonant deletions: (Korean examples)l-deletions (semi-regular)
nol+gonol+nigga = no+nigga
s-deletions (semi-irregular)is+gois+nigga = i+eu+nigga
Consonant Alternationsd/l-alternation
geod+gogeod+nigga = geol+eu+nigga
Korean -English
nol - playgo - andnigga -because geod - walkdob - help
23(c) Key-Sun Choi for Pan Localization 2005
Types of Verbal Stem Variations (2/2)
Glidizationb/u-alternatione.g.,
dob+godob+nigga = do+u+nigga
24(c) Key-Sun Choi for Pan Localization 2005
Language Processing
Analysisinput: Cambodianeunoutput: Cambodia/Noun + neun/topical particle
Generationinput: Cambodia/Noun + neun/topical particleoutput: Cambodianeun
25(c) Key-Sun Choi for Pan Localization 2005
Typical problems of Generation (0/6)
Ordering
Phonological ConstraintsOpen vs Closed SyllableVowel HarmonyBridgingContraction (optional and obligatory)
Recursion
26(c) Key-Sun Choi for Pan Localization 2005
Typical problems of Generation (1/6)
OrderingEnglish: ordering of auxiliary verbs in a sentence
well-ordered: The lion would have been being attackedill-ordered: The lion have would been attacked being
Many Asian Languages: ordering on agglutinization
well-ordered: ga+syeoss+seumnida+man (Korean)ill-ordered: ga+man+syeoss+seumnida
ga syeoss seumnida man가 셨 습니다 만
manseumnidaeosssiga“go” honorfic Past
tenseDeclarative
mood concessive
가 시 만었 습니다
Verb AUX Particle Ending Postposition
27(c) Key-Sun Choi for Pan Localization 2005
Typical problems of Generation (2/6)
Phonological Constraints (1/4)Open vs. Closed Syllable
Noun with open syllable (last grapheme is a vowel) + Postposition starting with consonant
– e.g., bird + neunNoun with closed syllable (last grapheme is a consonant) + Postposition starting with vowel
– e.g., animal + eun"The choice of case-marking particles depends on the syllable structure of the last syllable of each nominal stem."
28(c) Key-Sun Choi for Pan Localization 2005
Typical problems of Generation (3/6)
Phonological Constraints (2/4)Vowel Harmony: Clear and Dark Vowels
The first syllable of a verbal stem that ends in a clear (dark) vowel takes the clear (dark) vowel as one of its possible bridges
– clear vowel: e.g., a, o– dark vowel: e.g., eo [ə]
e.g.,– clear+clear: jab+a– dark+dark: meog+eo
29(c) Key-Sun Choi for Pan Localization 2005
Typical problems of Generation (4/6)
Phonological Constraints (3/4)Bridging
Verbal stems ending in a non-liquid consonant normally require the bridge "-eu-" to combine with an ending like "-myeon-"
– jab+eu-myeon– nol+myeon ("l" of "nol" is liquid consonant)
Verbal stems with a derived liquid required the bridge.– geol+myeon– geod+myeon = geol+eu-myeon (by d/l-alternation)
30(c) Key-Sun Choi for Pan Localization 2005
Typical problems of Generation (5/6)
Phonological Constraints (4/4)Contraction: optional and obligatory
Optional: bo+ass-da =boa+ss-daObligatory: *o+ass-da vs. oa+ss-daNo contraction: y+eoss-da vs. *yeo+ss-da
31(c) Key-Sun Choi for Pan Localization 2005
Typical problems of Generation (6/6)
RecursionParticle recursion prohibition
b'ang (bread)+man-eun*b'ang (bread)+man-eun+man-eun
Tense recursion prohibitionjab+ass-da (past declarative)jab+ass-eoss-da (perfect past declarative)*jab+ass-eoss-eoss-da
Bridge recursion prohibitionjab+eu-si-myeon (honorific-condition)jab+eu-si-eoss-eu-myeon (honorific-past-condition)*jab+eu-si-eoss-eu-si-eoss-eu-myeon
32(c) Key-Sun Choi for Pan Localization 2005
Language and Morphology
EnglishLemmatizer + POS taggerA word has at most several word forms
ChineseSegmentation System + POS taggerWord has one word form.
Korean (Japanese)Morphological Analyzer is more essential than other language
Complex SegmentationFrequent AgglutinationWord Formation
– The Chinese characters in common use
33(c) Key-Sun Choi for Pan Localization 2005
Complexity Comparison with other languages
Complexity?
In English, we have only to consider each word form.In Chinese, segmentation rather than morphological analysis.In Korean, MA should process the segmentation and agglutination simultaneously. Much more Complexity in segmentation and analyzing functional words. (Japanese is similar with Korean )
Spacing
Order of verb formsPer one verb
Complexity of Segmentation
English
Word form
5
Easy
Korean
Word Phrase
More than 5000
Very Hard
Chinese
No
1
Very Hard
34(c) Key-Sun Choi for Pan Localization 2005
Table of Contents
Introduction: Morphology and Characteristics English, Korean, Chinese, Japanese,
Complexity of MorphologyTypes of Ambiguities
General RepresentationComponents of Morphological AnalysisImplementation SchemeUnknown Words
Terminology and Computation Issues
35(c) Key-Sun Choi for Pan Localization 2005
General Morphological Analyzer: 2 phases
Candidate generationGenerates possible sequences of morphemes
Segmentation– Lemmatization - Recovery of phonetic changes
Processing for unknown word
Candidate selection“POS Tagging”Methodologies
Statistical methods: e.g., if we see two or three consecutive words and their POS tags, we can predict what the current word's tag is. (Noun-Verb-Noun) Rule-based methodsHybrid methods
36(c) Key-Sun Choi for Pan Localization 2005
General Scheme of Implementation
Morphological AnalyzerMorphological Analyzer
ParserParser
POS POS TaggerTaggerStatsticsStatstics
Linguistic RulesLinguistic Rules
Additional Additional ComponentsComponents
Unknown wordsUnknown words
Symbols/NumbersSymbols/Numbers
Exceptional Exceptional word phraseword phrase
PostPost--processingprocessing
Foreign wordsForeign words
Basic ComponentsBasic Components
CodeCode SystemSystem
DictionaryDictionaryPhonologicalPhonologicalRulesRules
ConnectionConnectionRulesRules
Algorithms Algorithms and and
Data StructureData Structure
37(c) Key-Sun Choi for Pan Localization 2005
Implementation (2/3)
Morphological connection rulesidiomspatternsweighted rules
Dictionaries (Lexicons)general lexicon
level of segmentation depends on the lexicon
domain-specific dictionarye.g., economics, law, patent (science & technology), ...
user-defined dictionary
38(c) Key-Sun Choi for Pan Localization 2005
Implementation (3/3)Diagram of word segmentation system for Chinese
Disambiguation
POS-Tagging
Unknown WordsIdentification
GenerateSubstrings
Word Matching
MorphologicalRules,Idioms,Patterns
Unknown WordModels
ChooseLexicon orLexicons
AccessLexicon
Find wordcandidates
level ofsegmentation
DomainSpecific/
User Defined Lexicon
GeneralLexicon
HeuristicRules
from (Keh-Jiann Chen, 2000)
39(c) Key-Sun Choi for Pan Localization 2005
Morphological Analysis of English
Affix Analysis : un+happy, be+er(X)
Additional ProcessingAbbreviationEx) I’d I wouldSymbols, Numerals and UnitsProcessing Idiomatic ExpressionUnknown Word Processing
40(c) Key-Sun Choi for Pan Localization 2005
Part of speech Tagging (English)Stochastic Part-of-Speech Tagging
Hidden Markov Model and Viterbi AlgorithmUsing Markov assumption – efficient
A Categorization Problem: Machine LearningDecision Tree, Neural Network, Markov Random Field, ...
FliesFlies like alike a flower.flower.NounNoun Prep. Art.Art. VerbVerb VerbVerb Noun NounNoun
Noun
flies/N
flies/V
like/V
like/N
like/P
a/ART
a/N
flower/V
flower/N
41(c) Key-Sun Choi for Pan Localization 2005
Part of speech Tagging (English)
Rule based Part-of-Speech TaggingTransformation-Based [Brill95] [Brill92]
Using contextual informationif Noun X Noun, then X may be Verb
Hybrid ModelIntegrating a stochastic taggerand rule-based system stochastic: tri-gram
P(Noun Verb Noun) = 0.2, P(Noun Noun Noun) = 0.01
Integrated ModelMaximum Entropy Model“Classifier Combination for Improved Lexical Disambiguation” [Brill 99]
Various Models have complementary behaviors.
Initial StateInitial State
LearnerLearner
Unannotated Text
Annotated Text Truth
RulesRulesRules
42(c) Key-Sun Choi for Pan Localization 2005
Corpora A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language
David Crystal, A Dictionary of Linguistics and PhoneticsA collection of naturally occurring language text, chosen to characterize a state or variety of a language.
John Sinclair, Corpus, Concordance, CollectionText Corpora
Written LanguageSpoken Language
CorporaEnglish: Brown, LOB, BNC and Penn TreeBankKorean: KAIST Corpus and Sejong CorpusJapan: Nihon Keizai Shimbun, EDR CorpusChina: Beijing University, LIVAC (HK City U.), Sinica corpus
Statistics can be obtained from a large Corpus.
43(c) Key-Sun Choi for Pan Localization 2005
Code System for Implementation (Korean)
Standard (KSC5601, KSC5700, KSX001)Syllable-based – 2-byte-codeincluding common Chinese Characters
Combinative Code (one Korean character = CV(C))Grapheme-based – 2-byte-code
Other internal code systems for MAN-byte code
3 byte code
2-3 variable byte Symbol code (Romanization style)
5bitsInitial ConsonantInitial Consonant
5bitsVowelVowel
5bitsFinal consonantFinal consonant
1bitMSB
1byte 1byte
word phraseword phrasesyllablesyllable
graphemegraphemeMorphemeMorpheme
44(c) Key-Sun Choi for Pan Localization 2005
Basic Components of Morphological Analyzer (1/2)
DictionaryLemma, Part-of-speech, Semantic Feature(optional)Structure
Connection RulesConnection Table (POS bigram)Word Formation Graph (n-gram)
Phonological RulesDeclarative Rules
2-level model
Procedural Rules
45(c) Key-Sun Choi for Pan Localization 2005
Basic Components of Morphological Analyzer (2/2)
Algorithms and Data StructuresTop-Down and Bottom-Up
Chart Parsing Recursive Parsing
Data StructureChart, Table, Tree, Lattice, Graphs
Disambiguation (Tagging phase)Dictionary Searching
All words in dictionary + rules for unknown words
Stem dictionary + rule = Derived dictionary
Stochastic ModelRule-based ModelHybrid Model, Integrated Model Simple Heuristics
preferring to the longest morpheme.
46(c) Key-Sun Choi for Pan Localization 2005
Word Segmentation in Chinese- A Heuristics -
Maximal matching ruleThe most plausible segmentation is the three word sequence with the maximal length.This heuristic rules achieves as high as 99.69% accuracy and a high applicability of 93.21%, i.e. the 93.21% of the ambiguities were resolved by this rule. [Chen & Liu 1992]完 成 鑑定
完成 鑑定 報
完成 鑑定 報告 (finish judgment report)
47(c) Key-Sun Choi for Pan Localization 2005
Additional Components of Analyzer
Unknown Word AnalysisSuffix, endings and preposition information
Symbol / Number expressionTemplatesEx) “2005年 6月 20日” (2005/6/20), “3千 4百圓” ($3,400), “+82-42-869-5565” (telephone number)
Foreign LanguagesTV, [kom-pyu-teo] (computer)
Exceptional Word PhraseWord Phrase Dictionary
Post-ProcessingSpacing ProblemProcessing Non-Standards language
48(c) Key-Sun Choi for Pan Localization 2005
Characteristics of different types of unknown words
Possibly infinite number of elements but with closed form representations, such as numeric type compounds:
2005年 = 2005-year
Open-ended types without closed form representations, such as
proper names: "Microsoft"derived words: "computer-ize"compounds: "computer desk"abbreviation (acronym): "LG" (Lucky-Goldstar)
from ACL [Chen, 2000]
49(c) Key-Sun Choi for Pan Localization 2005
Proper Nameswith Transliteration
PersonalBill Clinton (bil klintən, 克林頓)Jian Zemin (jiang z'əmin, 江澤民)
GeographicalKorea (koria, 韓國)Cambodia (kambodia, カンボジア)
Brands/OrganizationPanasonic (panasonik, 파나소닉,樂聲)Samsung (samsəŋ, 삼성, 三星, サムソン)
50(c) Key-Sun Choi for Pan Localization 2005
Numeric Expression Representation- Regular Expression -
To represent the type of unknown words with possibly infinite number of elements but with a closed form representation,
such as numbers, dates, times, determinant-measure compounds, etc.
e.g.,Number → Digit Number | DigitDigit → 0 | 1 | 2 | ... | 9
51(c) Key-Sun Choi for Pan Localization 2005
Methodologies
Tabular ParsingSegmentation by looking-up dictionary
Syllable information-based modelSegmentation by syllable information
Multi-phase filteringBrute-force
Etc.Head-tail : simple model for segmentation
52(c) Key-Sun Choi for Pan Localization 2005
POS Tagging: Statistical approaches
Statistical ModelsHMM-based
HMM on Morpheme sequences (bi-gram, tri-gram)
HMM Using Word Phrase Structure(HMM using Intra-/inter- word phrases information)
Weighted NetworkMaximum Entropy Model
More information can be integrated into model
Pros Easy for training, guarantees not bad performance
ConsDifficult to tune or modify Requires more space
53(c) Key-Sun Choi for Pan Localization 2005
POS Tagging: other approaches
Rule-based approachTransformation Rules (Eric Brill’s Style)Pros and cons
Difficult to get rules and maintain consistency of rulesEasy to lexicalize
Hybrid ApproachStatistical approach + Rule Based ApproachPros and cons
Guarantees a good performanceDifficult to integrate
54(c) Key-Sun Choi for Pan Localization 2005
Using word phrase Structure in Tagging
“An HMM POS tagger for Korean based on Wordphrase” (J, Shin 1994) : simple model
“Two-ply HMM” (J. Kim 1997)
HMM variation using POS tags of head and tail of word phrase
H
T$
H
T
H
T
$
Each word phrase has a conventional tagging HMM.Morpheme / POS tags
$
NounN
Noun
Objective.
N
Noun ModifierN p
AdverbAA
Subjective.
p
pNSNS
NMNM
a
NONO
Verb V connectingePCPC
Verb V final ePFPF
55(c) Key-Sun Choi for Pan Localization 2005
HMM
Hidden Markov ModelWhat is the sequence of nodes?
transient probability usually 1- or 2- or 3-gram
What is the tag (or label) of the node? ("hidden")
56(c) Key-Sun Choi for Pan Localization 2005
Applications
Machine TranslationSpell checking and correction
Spell correctionAuto-spacing and spacing correction
Information RetrievalExtracting index terms – noun and noun phrasesQuestion Answering
Natural Language InterfaceText-To-Speech
Concordance
57(c) Key-Sun Choi for Pan Localization 2005
Application: TTS and IR- A Japanese Example -
Text to speech synthesisWord segmentation of orthographic text
試験|の|最中|に|映画|へ|行った
(I went to movie during examination)
Homograph disambiguation (grapheme to phoneme)最中 → saichu (during), monaka (Japanese sweets)行った → itta (went), okonatta (did)
Information retrieval (indexing)Word segmentation of orthographic text
試験|の|最中|に|映画|へ|行った
(I went to movie during examination)Part of speech tagging, keyword extraction, stemming
試験 (examination), 映画 (movie)
from the slide (Nagata, ACL2000)
58(c) Key-Sun Choi for Pan Localization 2005
Evaluation of Morphological Analysis
CorrectnessRecallPrecision
In unit of morpheme and word phrases
Processing SpeedRobustness
Processing erroneous inputSpacingSpell errors
Effectiveness of ResultEvaluation of Tag set
59(c) Key-Sun Choi for Pan Localization 2005
Recall and Precision
Recall = how much the correct one was hit from the whole setPrecision = how much is correct among the system generated set
60(c) Key-Sun Choi for Pan Localization 2005
Resources and Tools
Corpora and Tagsets for morphological analysis
CorpusContest for standardization of POS tagset.
Visualization ToolTo provide a visible process understandable to user and easy to debug for developer
61(c) Key-Sun Choi for Pan Localization 2005
Table of Contents
Introduction: Morphology and Characteristics English, Korean, Chinese, Japanese,
Complexity of MorphologyTypes of Ambiguities
General RepresentationComponents of Morphological AnalysisImplementation SchemeUnknown Words
Terminology and Computation IssuesTranslation (Localization) of New TermsTransliteration
62(c) Key-Sun Choi for Pan Localization 2005
C.K. Ogden/I.A. Richards, The Meaning of MeaningA Study in the Influence of Language upon Thought and The Science of SymbolismLondon 1923, 10th edition 1969
CONCEPT
Referent
Refers To Symbolizes
Stands For
“Orange”,“ClipArt”
What is Terminology?
from the slide of [Bargmeyer, Bruce, Open Metadata Forum, Berlin, 2005]
63(c) Key-Sun Choi for Pan Localization 2005
ReferentReferent
ConceptConcept
TermTermtroutSalmo truttabrown trouttruite
Definition: Any of several game fishes of the genus Salmo, related to the salmon...
Registering TerminologyRegistering Terminology
Refers To Symbolizes
Stands For
from the slide of [Bargmeyer, Bruce, Open Metadata Forum, Berlin, 2005]
(c) Key-Sun Choi for Pan Localization 2005
Terminology Localization:Recommending Translations of Term
using Term constituents
Sense
Usage
ConstituentsConstituents
Transliteration
IS-A/Causalrelations
Definition
TerminologyTerminology
65(c) Key-Sun Choi for Pan Localization 2005
Goals
To Give Information about Word Formation (Term constituent)
Usage of each term constituentsFrequencies of each term constituentsTranslation Patterns of each term constituents
To Generate a Translation list for new Foreign (English) term
To recommend translation of new terms
66(c) Key-Sun Choi for Pan Localization 2005
Scope
In order to translate new English terms into Korean terms
1) Producing translation candidates2) Selecting the relevant one among them (or Ranking)
1)自動振動制御2) 自動振動調節3) 自動周波制御4) 自動周波調節
•“automatic frequency control”•Automatic “jadoŋ (自動)”,
•Frequency “jindoŋ (振動)”, “jupa (周波)”
•Control “jeə (制御)”, “jojəl (調節)”
1) Producing translation candidates1) Producing translation candidates 2) Ranking2) Ranking
67(c) Key-Sun Choi for Pan Localization 2005
Problems occur in resolving the meaning of the conceptual unit
Homonyms
Synonyms or variations
68(c) Key-Sun Choi for Pan Localization 2005
Homonyms
Mainly caused by Sino-Korean words, pronunciation of which derive directly from Chinese e.g.) ‘-gi’ in biology
Need to know their proper meaning and context
Olfactory organ‘hu-gak+gi’Organ (器)
Growth phase‘saeng-jang+gi’Stage, phase (期)
Contraction period‘su-chuk+gi’Period (紀)
Amino group‘a-mi-no+gi’Group (基)
English termKorean termMeaning of ‘-gi’
69(c) Key-Sun Choi for Pan Localization 2005
Synonyms or variations
Mainly caused by various ways of translationsBecause many Korean terms are from foreign origin
e.g.) abdominal in biologyTranslated into two Sino-Korean wordsTranslated into a pure Korean word
Need to know whether they indicate the same meaning
abdominal fin‘bae+ji-neu-leo-mi’‘bae’
abdominal cavity‘bok+gang’‘bok’ (腹)
abdominal appendage‘bok-bu+bu-sok-ji’‘bok-bu’ (腹部)
English termKorean termKorean translations
70(c) Key-Sun Choi for Pan Localization 2005
Synonyms
‘a-se-teu-san’아세트산
‘gok-ryul’곡률
‘bae’배
‘bok’복
‘gub-eun+yul’굽은률
curvaturechemistry
‘cho-san’ 초산
aceticphysics
‘bok-bu’복부
abdominalBiology
KoreanEnglishDomain
71(c) Key-Sun Choi for Pan Localization 2005
Homonyms
element
crystal
region
‘su-jeong’수정
correction
Chemistry
‘yo-so’요소
urea
Physics
‘ji-bang’지방
fat
Biology
Korean English Domain
72(c) Key-Sun Choi for Pan Localization 2005
Domain Dependency
Different translations depending on domains
Cell
Cell
English
‘jeon-ji’A single unit that converts radiant energy into electric energy
Chemistry, Physics
‘se-po’The smallest structural unit of an organism
Biology
KoreanDifferent Meaning Domain
(c) Key-Sun Choi for Pan Localization 2005
Terminology Localization:Automatic English-to-Korean
transliteration
Sense
Usage
Constituents
TransliterationTransliteration
IS-A/Causalrelations
Definition
TerminologyTerminology
74(c) Key-Sun Choi for Pan Localization 2005
Goal
For the given English word or term,Accuracy
to generate correct transliterations
Diversityto generate various transliteration variations as many as possible.
data
deitə
deita
de-ta
TransliterationSystem
Diversity
1) 데이터2) 데이타3) 데타
Ranking
Accuracy
75(c) Key-Sun Choi for Pan Localization 2005
Step 1: Input an English word to be transliterated
datadata
Transliteration
push the transliteration button
76(c) Key-Sun Choi for Pan Localization 2005
Step 2: Transliteration
Ranked transliteration results for ‘data’
77(c) Key-Sun Choi for Pan Localization 2005
Step 3: Result validation (Optional)
Click !!
(c) Key-Sun Choi for Pan Localization 2005
Language varies over
TIME and SPACE
different sense/usage in different domaindifferent meaning in different time
(c) Key-Sun Choi for Pan Localization 2005
Contact Point:CHOI, Key-Sun
[email protected] Science Division, KAIST
373-1 Guseong-dong Yuseong-gu Daejeon 305-701 S.Korea
Korea Advanced Institute of Science & Technologyhttp://www.kaist.edu/
Korea Terminology Research Center for Language & Knowledge Engineeringhttp://www.korterm.org/
Bank of Language Resources, KNRRChttp://bola.kaist.ac.kr/
ISO/TC37/SC4 for Language Resource Management Standardshttp://www.tc37sc4.org/
International Joint Conference on Natural Language Processinghttp://www.afnlp.org/IJCNLP05/
(c) Key-Sun Choi for Pan Localization 2005
Thank you!!