ling 696b: gradient phonotactics and well-formedness
Post on 14-Jan-2016
24 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
LING 696B: Gradient phonotactics and well-formedness
2
Vote on remaining topics Topics that have been fixed:
Morpho-phonological learning (Emily) + (LouAnn’s lecture) + Bayesian learning
Rule induction (Mans) + decision tree Learning and self-organization
(Andy’s lecture)
3
Voting on remaining topics Select 2-3 from the following (need
a ranking): OT and Stochastic OT Alternatives to OT: random
fields/maximum entropy Minimal Description Length word
chopping Feature-based lexical access
4
Well-formedness of words (following Mike’s talk) A word “sounds like English” if:
It is a close neighbor of some words that sound really English. E.g. “pand” is neighbor of sand, band, pad, pan, …
It agrees with what English grammar says what an English word should look like, e.g. gradient phonotactics says blick > bnick
5
Well-formedness of words (following Mike’s talk) A word “sounds like English” if:
It is a close neighbor of some words that sound really English. E.g. “pand” is neighbor of sand, band, pad, pan, …
It agrees with what English grammar says what an English word should look like, e.g. gradient phonotactics says blick > bnick
Today: relate these two ideas to the non-parametric and parametric perspectives
6
Many ways of calculating probability of a sequence Unigrams, bigrams, trigrams,
syllable parts, transition probabilities … No bound on the number of creative
ways
7
Many ways of calculating probability of a sequence Unigrams, bigrams, trigrams, syllable
parts, transition probabilities … No bound on the number of creative ways
What does it mean to say the “probability” of a phonological word? Objective/frequentist v.s. subjective/
Bayesian: philosophical (but important)
8
Many ways of calculating probability of a sequence
Unigrams, bigrams, trigrams, syllable parts, transition probabilities … No bound on the number of creative ways
What does it mean to say the “probability” of a phonological word? Objective/frequentist v.s. subjective/
Bayesian: philosophical (but important) Thinking “parametrically” may clarify
things “likelihood” = “probability” calculated from
a model
9
Parametric approach to phonotactics Example: “bag of sounds”
assumption/ exchangable distributions p(blik) = p(lbik) = p(kbli)
10
Parametric approach to phonotactics Example: “bag of sounds”
assumption/ exchangable distributions p(blik) = p(lbik) = p(kbli)
Unigram models: N-1 parameters
B L I K
What is ?How to get (hat)?How to assign prob to “blick”?
11
Parametric approach to phonotactics Unigram model with overlapping
observations: N2 - 1 parameters
B L I K
Note: input is #B BL LI IK K#
What is ?How to get (hat)?How to assign prob to “blick”?
12
Parametric approach to phonotactics Unigram with annotated
observations (Coleman and Pierrehumbert)
BL IKOnset of strong
Initial/final syllableRhyme of strong
Initial/final syllable
“osif” “rsif”
Input: segment annotated with a syllable parse
13
Parametric approach to phonotactics Bigram model: N(N-1) parameters
{p(wn|wn-1)} (how many for trigram?)
B L I K
Input: segment sequence
14
Ways that theory might help calculate probability Probability calculation must be
based on an explicit model Need a story about what sequences
are How can phonology help with
calculating sequence probability? More delicate representations More complex models
15
Ways that theory might help calculate probability Probability calculation must be
based on an explicit model Need a story about what sequences
are How can phonology help with
calculating sequence probability? More delicate representations More complex models
But: phonology is not quite about what sequences are …
16
More delicate representations Would CV phonology help?
Auto-segmental tiers, features, gestures? The chains no longer independent: more
sophisticated models are needed Limit: generative model of speech
production (very hard)
B L I K I T
17
More complex models Mixture of unigrams
Used in document classification
B L I K
Lexical strata
Unigram
18
More complex models More structure in the Markov chain
Can also model the length distribution with the so-called semi-Markov models
BL IK
“onset” “rhyme V”
“rhyme VC”
19
More complex models Probabilistic context free grammar
Syllable --> C + VC (0.6) Syllable --> C + V (0.35) Syllable --> C + C (0.05) C --> _ (0.01) C --> b (0.05) …
See 439/539
20
What’s the benefit for doing more sophisticated things? Recall: maximum likelihood need
more data to produce a better estimate
Data sparsity problem: training data often insufficient for estimating all the parameters, e.g. zero counts Lexicon size: we don’t have infinitely
many words to estimate phonotactics Smoothing: properly done, has a
Bayesian interpretation (often not)
21
Probability and well-formedness Generative modeling: characterize a
distribution over strings Why should we care about this
distribution? Hope: this may have something to do with
grammaticality judgements But: judgements also affected by what
other words “sound like”. Puzzle of mrupect/mrupation It may be easier to model a function with
input = string, output = judgements
22
Bailey and Hahn Tried all kinds of ways of calculating
phonotatics and neighborhood density, and see which combination “works the best” Typical reasoning: “metric X and Y as
factors explain 15% variance”
23
Bailey and Hahn Tried all kinds of ways of calculating
phonotatics and neighborhood density, and see which combination “works the best” Typical reasoning: “metric X and Y as factors
explain 15% variance” Methodology: ANOVA
Model (1-way): data = overall mean + effect + error
What can ANOVA do for us? How do we check if ANOVA makes sense? What is the “explained variance”?
24
Non-parametric approach to similarity neighborhood
A hint from B&H: the neighborhood model dij is weighted edit distance A,B,C,D estimated from polynomial
regression Recall: radial basis functions F(x) = i
ai K(x, xi), with K(x, xi) = e -d(x, xi)
Quadratic weighting ad hoc, should just do general nonlinear regression with RBF
25
Non-parametric approach to similarity neighborhood Recall: RBF as a “soft”
neighborhood model
Now think of strings also as data points, with neighborhood defined by some string distance (e.g. edit) Same kind of regression with RBF
26
Non-parametric approach to similarity neighborhood Key technical point: choosing the
right kernel Edit-distance kernel: K(x, xi) = e - edit(x, xi)
Sub-string kernel: measuring the length of common sub-sequence (mrupation)
Key experimental data: controlled stimuli, split into training and test sets (equal phonotactic prob) No need to transform rating scale
27
Non-parametric approach to similarity neighborhood An enterprise of questions open up
with the non-parametric perspective: Would yes/no task lead to word
“anchor” like support vectors? Would the new words interact with
each other, as seen in the transductive inference?
What type of metric most appropriate for inferring well-formedness from neighborhoods?
28
Integration Hard to integrate with a probabilistic
(parametric) model Neighborhood density has a strong non-
parametric character -- grows with data Possible to integrate phonotactic prob
in a non-parametric model: kernel algebra aK1(x,y) + bK2(x,y), K1(x,y)*K2(x,y) are
also kernels p kernel: K(x1, x2)= i p(x2|h)p(x1|h)p(h) p
comes from parametric model
top related