an unsupervised method for uncovering morphological...
TRANSCRIPT
An Unsupervised Method for Uncovering Morphological Chains
Karthik NarasimhanRegina BarzilayTommi Jaakkola
CSAIL, Massachusetts Institute of Technology
1
Morphological Chains
2
Morphological ChainsChains to model the formation of words.
2
Morphological ChainsChains to model the formation of words.
paint → painting → paintings
2
Morphological ChainsChains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios
Morphological ChainsChains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios
Segmentation
Morphological ChainsChains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios
Segmentation Paradigms
Morphological ChainsChains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios
Segmentation Paradigms
Our Approach
3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
Our Approach
• Orthographic featuresMorfessor (Goldwater and Johnson, 2004; Creutz and Lagus, 2007), Poon et al., 2009, Dreyer and Eisner, 2009, Sirts and Goldwater, 2013
3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
Our Approach
• Orthographic featuresMorfessor (Goldwater and Johnson, 2004; Creutz and Lagus, 2007), Poon et al., 2009, Dreyer and Eisner, 2009, Sirts and Goldwater, 2013
• Semantic featuresSchone and Jurafsky, 2000; Baroni et al., 2002
3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
Our Approach
• Orthographic featuresMorfessor (Goldwater and Johnson, 2004; Creutz and Lagus, 2007), Poon et al., 2009, Dreyer and Eisner, 2009, Sirts and Goldwater, 2013
• Semantic featuresSchone and Jurafsky, 2000; Baroni et al., 2002
• Handle transformations. (plan → planning)3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
Textual Cues
4
Textual CuesOrthographic
4
Textual CuesOrthographic
Patterns in the characters forming words.
4
Textual CuesOrthographic
Patterns in the characters forming words.
paint paints
painted
pain pains
pained
4
Textual CuesOrthographic
Patterns in the characters forming words.
paint paints
painted
pain pains
pained
pain paint
ran rant
4
Textual CuesOrthographic
Patterns in the characters forming words.
paint paints
painted
pain pains
pained
pain paint
ran rant
Semantic
Meaning embedded as vectors.
4
Textual CuesOrthographic
Patterns in the characters forming words.
paint paints
painted
pain pains
pained
pain paint
ran rant
A B cos(A,B)paint paints 0.68paint painted 0.60pain pains 0.60pain paint 0.11ran rant 0.09
Semantic
Meaning embedded as vectors.
4
Textual CuesOrthographic
Patterns in the characters forming words.
paint paints
painted
pain pains
pained
pain paint
ran rant
A B cos(A,B)paint paints 0.68paint painted 0.60pain pains 0.60pain paint 0.11ran rant 0.09
Semantic
Meaning embedded as vectors.
4
Task Setup
5
Training
Unannotated word list with frequencies
Word Vector Learning
Large text corpus
Wikipedia
a ability able about
395134 17793 56802
524355
6
6
nation → national → international → internationally
nation → national → nationally → internationally
Multiple chains possible for a word.
6
nation → national → international → internationally
nation → national → nationally → internationally
Multiple chains possible for a word.
nation → national → international → internationally
nation → national → nationalize
Different chains can share word pairs.
Independence Assumption
7
Independence Assumption
7
Treat word-parent pairs separately
Independence Assumption
7
national
Word (w)
Treat word-parent pairs separately
Independence Assumption
7
national
Word (w)
nation
Parent (p)
Suffix
Type (t)
Treat word-parent pairs separately
Independence Assumption
7
national
Word (w)
nation
Parent (p)
Suffix
Type (t)
Treat word-parent pairs separately
Independence Assumption
7
national
Word (w)
Candidate (z)
nation
Parent (p)
Suffix
Type (t)
Treat word-parent pairs separately
8
national
Word (w)
nation
Parent (p)
Suffix
Type (t)
Candidate (z)
P (w, z) / e✓·�(w,z)
8
national
Word (w)
nation
Parent (p)
Suffix
Type (t)
Candidate (z)
P (w, z) / e✓·�(w,z)
8
national
Word (w)
nation
Parent (p)
Suffix
Type (t)
Candidate (z)
Types - Prefix, Suffix, Transformations, Stop.
• Templates for handling changes in stem during addition of affixes.
• Repetition template: PQ → PQQR (for each Q in alphabet). Ex.
• Feature template for each transformation.
Transformations
plan → planning
P Q R
9
Transformation types
10
3 different transformations:
Transformation types
10
3 different transformations:
• Repetition (plan → planning)
Transformation types
10
3 different transformations:
• Repetition (plan → planning)
• Deletion (decide→deciding)
Transformation types
10
3 different transformations:
• Repetition (plan → planning)
• Deletion (decide→deciding)
• Modification (carry → carried)
Transformation types
10
3 different transformations:
• Repetition (plan → planning)
• Deletion (decide→deciding)
• Modification (carry → carried)
Trade-off between types of transformation and computational tractability.
Transformation types
10
3 different transformations:
• Repetition (plan → planning)
• Deletion (decide→deciding)
• Modification (carry → carried)
Trade-off between types of transformation and computational tractability.
• These three do well for a range of languages and are computationally tractable: max O(|∑|2) for alphabet ∑
Transformation types
10
Features φ(w,z)
11
Features φ(w,z)
11
Orthographic
Features φ(w,z)
11
Orthographic
• Affixes: Indicator feature for top affixes
Features φ(w,z)
11
Orthographic
• Affixes: Indicator feature for top affixes
• Affix Correlation: pairs of affixes sharing set of stems (inter-, re-), (under-, over-)
Features φ(w,z)
11
Orthographic
• Affixes: Indicator feature for top affixes
• Affix Correlation: pairs of affixes sharing set of stems (inter-, re-), (under-, over-)
• Word freq. of parent
Features φ(w,z)
11
Orthographic
• Affixes: Indicator feature for top affixes
• Affix Correlation: pairs of affixes sharing set of stems (inter-, re-), (under-, over-)
• Word freq. of parent
• Transformation types with character bigrams
Features φ(w,z)
11
Orthographic
• Affixes: Indicator feature for top affixes
• Affix Correlation: pairs of affixes sharing set of stems (inter-, re-), (under-, over-)
• Word freq. of parent
• Transformation types with character bigrams
Semantic
• Cosine similarity between word vectors of word and parent
Features φ(w,z)
11
Orthographic
• Affixes: Indicator feature for top affixes
• Affix Correlation: pairs of affixes sharing set of stems (inter-, re-), (under-, over-)
• Word freq. of parent
• Transformation types with character bigrams
Semantic
• Cosine similarity between word vectors of word and parent
Cosine similarity with player
Learning
12
Learning
• Objective:
12
Learning
• Objective:
12
Y
w
P (w) =Y
w
X
z
P (w, z) =Y
w
X
z
e✓·�(w,z)
Pw02⌃⇤,z0 e✓·�(w
0,z0)
Learning
• Optimize likelihood using convex optimization: LBFGS-B (with regularization)
• Objective:
12
Y
w
P (w) =Y
w
X
z
P (w, z) =Y
w
X
z
e✓·�(w,z)
Pw02⌃⇤,z0 e✓·�(w
0,z0)
Learning
• Optimize likelihood using convex optimization: LBFGS-B (with regularization)
• Not tractable - requires summing over all possible strings in alphabet to calculate normalization constant, Z.
• Objective:
12
Y
w
P (w) =Y
w
X
z
P (w, z) =Y
w
X
z
e✓·�(w,z)
Pw02⌃⇤,z0 e✓·�(w
0,z0)
Learning
• Optimize likelihood using convex optimization: LBFGS-B (with regularization)
• Not tractable - requires summing over all possible strings in alphabet to calculate normalization constant, Z.
• Objective:
12
Y
w
P (w) =Y
w
X
z
P (w, z) =Y
w
X
z
e✓·�(w,z)
Pw02⌃⇤,z0 e✓·�(w
0,z0)
Contrastive Estimation
13
Contrastive Estimation
• Instead, we use Contrastive Estimation (Smith and Eisner, 2005):
13
Contrastive Estimation
• Instead, we use Contrastive Estimation (Smith and Eisner, 2005):
• Neighborhood of invalid words for each word to take probability mass from.
13
Contrastive Estimation
• Instead, we use Contrastive Estimation (Smith and Eisner, 2005):
• Neighborhood of invalid words for each word to take probability mass from.
• Transpose pair of adjacent chars from first and last k chars in word. (Ex. painting → paintnig)
13
Contrastive Estimation
• Instead, we use Contrastive Estimation (Smith and Eisner, 2005):
• Neighborhood of invalid words for each word to take probability mass from.
• Transpose pair of adjacent chars from first and last k chars in word. (Ex. painting → paintnig)
P (w, z) = e✓·�(w,z)P
w02N(w),z0 e✓·�(w0,z0)
13
Prediction• Predict chain in recursive fashion (argmax parent
candidate each time) till stop.
paintings
painting
paint
STOP
14
Segmentation Experiments • Three languages - English, Arabic, Turkish
• Evaluation: Morphological segmentation - Precision, Recall, F1 over individual segmentation points
• Baselines: Morfessor-Baseline, Morfessor CatMAP, AGMorph (Sirts and Goldwater, 2013) and Lee et al. (2011)
15
F1 scores on MorphoChallenge
0
22.5
45
67.5
90
English Turkish Arabic
Morfessor-B Morfessor-C AGMorph Lee (M2)MorphoChain
16
Effect of data size
17
Affix Analysis
18
Error analysis• Errors (on a random subset of 50 words per language):
Language Over-segment Under-segment
English 10% 86%
Turkish 12% 78%
Arabic 60% 40%
• Most errors (58%) in Turkish due to parent words not present or having low count.
• Root template morphology of Arabic causes 14% of errors.
19
Sample segmentations
20
Morphology in Keyword Spotting
Morphology in Keyword Spotting• Keyword Spotting is the task of identifying keywords in speech
utterances • Major issue: Out of vocabulary words
Morphology in Keyword SpottingA
TWV
0
0.05
0.1
0.15
0.2
KWS-Test
Supervised Unsupervised (Morfessor)
KWS Results on OOV keywords in Turkish LLP (Narasimhan et al., 2014)
• Keyword Spotting is the task of identifying keywords in speech utterances
• Major issue: Out of vocabulary words
Morphology in Keyword SpottingA
TWV
0
0.05
0.1
0.15
0.2
KWS-Test
Supervised Unsupervised (Morfessor)
KWS Results on OOV keywords in Turkish LLP (Narasimhan et al., 2014)
• Adding morphemes helps KWS. • Better morphology can lead to better KWS (supervised vs.
unsupervised) • Need for better unsupervised segmentation.
• Keyword Spotting is the task of identifying keywords in speech utterances
• Major issue: Out of vocabulary words
• MorphoChain outperforms state-of-the-art unsupervised morphological system on KWS
ATW
V
20.2
20.35
20.5
20.65
20.8
w/o web data
Morfessor MorphoChain
Morfessor vs MorphoChain for KWS
ATW
V
23
23.75
24.5
25.25
26
w/ web data
ATWV scores on Bengali VLLP
*in collaboration with Damianos Karakos and Rich Schwartz at BBN
Conclusions• A new method for unsupervised morphological
analysis incorporating both orthographic and semantic features.
• Equals or outperforms state-of-the-art systems on morphological segmentation.
• Works well on downstream tasks.
Code: http://people.csail.mit.edu/karthikn/morphochain/
23
24